狠狠撸

Effective Data Analysis
A Comprehensive Workflow with TTVM Process & R
呂奕
1

About
These slides provide introductory material to help
improve skills for manipulation data, efficiently
modeling, and getting insights through such process.
The tool used in this slide is "R", which is a popular
open-source software, not only as a statistical software
but a programming language.
2

你是否有以下困擾？
? 不知道什麼資料可以被使用
? 拿到的資料跟天書一樣無法整理
? 終於下定決心要整理資料時不知道從哪裡開始
? 火眼金睛的整理方式無法確保是否遺漏或手殘
? 一旦資料出錯就無法回頭，資料夾一堆"xxx_backup"檔案
? 終於整理完後可以分析的方法非常有限
? 下次遇到同樣的東西，又要從頭再來
3

你是否有以下困擾？
? 別人看不懂你的處理方法，協同工作很困難
? 決定方法後要動手做很痛苦，開始從厚厚一疊參考資料翻找
? 畫圖很痛苦
? 做模型很痛苦
? 一大堆模型不知道怎麼解釋和選擇
4

Data analysis
is the process by which data becomes
understanding, knowledge and insight
6

8http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?pagewanted=all

Why R?
? It is free
? It has a comprehensive set of packages
? Data access
? Data cleaning
? Analysis
? Data reporting
? It has one of the best development environments -
RStudio http://www.rstudio.com/
? It has an amazing ecosystem of developers
? Packages are easy to install and "play nicely together"
9

Why NOT SPSS?
因為以下要講的觀念SPSS都很難做到
而且SPSS很貴
10

Before thinking outside the box
We have to look inside the black box and figure out how it
works.
Not until we understand the mechanism of (quantitative)
data analysis do we really master the (quantitative)
analysis skill.
12

13
Tidy
Acquiring
Data Transform
Visualize
Model
Interpret
Modified from Hadley Wickham

14
used to be…
Computation time >> Cognition time

15https://www.flickr.com/photos/mutsmuts/4695658106
should be…
Cognition time ? Computation time

現實中遇到的資料 …
通常來源都是沒有整理過的
資料散落在各處，儲存格式幾乎都不一樣
必須結合各種其他資料源才能獲得有用的資訊
資料是動態的湧入，不斷持續增加
18

Data analysis
It is often said that 80% of data analysis is spent on
the process of cleaning and preparing the data
(Dasu and Johnson 2003).
19

Defining tidy data
Like families, tidy datasets are all alike but every messy dataset is
messy in its own way.
1. Each variable forms a column.
2. Each observation forms a row.
3. Each type of observational unit forms a table.
This is Codd's 3rd normal form (Codd 1990)
20

Messy dataset
21
religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
Agnostic 27 34 60 81 76 137
Atheist 12 27 37 52 35 70
Buddhist 27 21 30 34 33 58
Catholic 418 617 732 670 638 1116
Don't know/refused 15 14 15 11 10 35
Evangelical Prot 575 869 1064 982 881 1486
Hindu 1 9 7 9 11 34
Historically Black Prot 228 244 236 238 197 223
Jehovah's Witness 20 27 24 24 21 30
Jewish 19 19 25 25 30 95

Tidy data
22
religion income freq
Agnostic <$10k 27
Agnostic $10-20k 34
Agnostic $20-30k 60
Agnostic $30-40k 81
Agnostic $40-50k 76
Agnostic $50-75k 137
Agnostic >150k 84
Agnostic Don't know/refused 96

If you can get things done at a time,
then don't spend
dozens!
23

Split, Apply, Combine
26
name n
Al 2
Bo 4
Bo 0
Bo 5
Ed 5
Ed 10
name n
Al 2
name n
Bo 4
Bo 0
Bo 5
name n
Ed 5
Ed 10
total
2
total
9
total
15
name n
Al 2
Bo 9
Ed 15

Simple and comprehensible code
makes your work replicable and easy to debug
最重要的是：永遠不需要再砍掉重練，錯了可以重來
27

資料要怎麼處理才
畫得出來？
32

FA , clustering好像很難
35

Flexible / Learning by doing
38

很多別人佛心寫好的套件
39

有脈絡的流程，易於發現問題
40http://www.slideshare.net/ckliu/z-b-38495724 | http://gene.speaking.tw/2014/10/28.html

淺談 Big data
先別管big不big了，你知道分析方法有甚麼不同嗎？
→ 事實上沒什麼不同：假設、驗證、預測(學習)
而且只是你的硬碟還裝得下的檔案，基本上都不算big
43

淺談 Big data
big data 不是什麼新概念，就是一個莫名近年來莫名在炒的話題。
big data確實是管理問題，很多公司還把它當data mining，跑跑公司的
交易資料，甚至根本不信data這套的還更多。太多的unstructured data
和machine data沒利用到了，甚至是open structured data也根本沒在用。
再來是公司data-driven decision 做到什麼程度，只有行銷做一做嗎？策
略性的去累積你的data，以及訓練你的model、data automation的程度，
都會變成贏過對手的競爭優勢。透過結合Big data+ machine learning +
cloud打造出來的應用會大大替代過去的各種商業模式。
44http://www.bnext.com.tw/article/view/id/34692

狠狠撸

effective data analysis with R

More Related Content

effective data analysis with R