This document provides an overview of effective data analysis using R. It discusses common challenges with data preparation and introduces the TTVM process for data analysis, which stands for Tidy, Transform, Visualize, Model, and Interpret. The document explains why R is a useful tool for data analysis due to its packages for data access, cleaning, analysis, and reporting. It also emphasizes that most of the work in data analysis involves cleaning and preparing the data before analyzing or modeling can begin.
2. About
These slides provide introductory material to help
improve skills for manipulation data, efficiently
modeling, and getting insights through such process.
The tool used in this slide is "R", which is a popular
open-source software, not only as a statistical software
but a programming language.
2
9. Why R?
? It is free
? It has a comprehensive set of packages
? Data access
? Data cleaning
? Analysis
? Data reporting
? It has one of the best development environments -
RStudio http://www.rstudio.com/
? It has an amazing ecosystem of developers
? Packages are easy to install and "play nicely together"
9
12. Before thinking outside the box
We have to look inside the black box and figure out how it
works.
Not until we understand the mechanism of (quantitative)
data analysis do we really master the (quantitative)
analysis skill.
12
19. Data analysis
It is often said that 80% of data analysis is spent on
the process of cleaning and preparing the data
(Dasu and Johnson 2003).
19
20. Defining tidy data
Like families, tidy datasets are all alike but every messy dataset is
messy in its own way.
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
This is Codd's 3rd normal form (Codd 1990)
20
26. Split, Apply, Combine
26
name n
Al 2
Bo 4
Bo 0
Bo 5
Ed 5
Ed 10
name n
Al 2
name n
Bo 4
Bo 0
Bo 5
name n
Ed 5
Ed 10
total
2
total
9
total
15
name n
Al 2
Bo 9
Ed 15
27. Simple and comprehensible code
makes your work replicable and easy to debug
最重要的是:永遠不需要再砍掉重練,錯了可以重來
27
44. 淺談 Big data
big data 不是什麼新概念,就是一個莫名近年來莫名在炒的話題。
big data確實是管理問題,很多公司還把它當data mining,跑跑公司的
交易資料,甚至根本不信data這套的還更多。太多的unstructured data
和machine data沒利用到了,甚至是open structured data也根本沒在用。
再來是公司data-driven decision 做到什麼程度,只有行銷做一做嗎?策
略性的去累積你的data,以及訓練你的model、data automation的程度,
都會變成贏過對手的競爭優勢。透過結合Big data+ machine learning +
cloud打造出來的應用會大大替代過去的各種商業模式。
44http://www.bnext.com.tw/article/view/id/34692