在此課程中將帶領對資料分析感到陌生卻又充滿興趣的您,完整地學會運用 R 語言從最初的蒐集資料、探索性分析解讀資料,並進行文字探勘,發現那些肉眼看不見、隱藏在資料底下的意義。此課程主要設計給對於 R 語言有基本認識,想要進一步熟悉實作分析的朋友們,希望在課程結束後,您能夠更熟悉 R 語言這個豐富的分析工具。透過蘋果日報慈善捐款的資料集,了解如何從頭解析網頁,撰寫爬蟲自動化收集資訊;取得資料後,能夠靈活處理資料,做清洗、整合及探索;並利用現成的套件進行文字探勘、文本解析;我們將一步步實際走一回資料分析的歷程,處理、觀察、解構資料,試著看看人們在捐款的決策過程中,究竟是什麼因素產生了影響,以及這些結果又是如何從資料中挖掘而出的呢?
The PyConTW (http://tw.pycon.org) organizer wishes to improve the quality and quantity of the programming cummunities in Taiwan. Though Python is their core tool and methodology, they know it's worth to learn and communicate with wide-ranging communities. Understanding cultures and ecosystem of a language takes me about three to six months. This six-hour course wraps up what I - an experienced Java developer - have learned from Python ecosystem and the agenda of the past PyConTW.
你可以在以下鏈結找到中文內容:
http://www.codedata.com.tw/python/python-tutorial-the-1st-class-1-preface
在此課程中將帶領對資料分析感到陌生卻又充滿興趣的您,完整地學會運用 R 語言從最初的蒐集資料、探索性分析解讀資料,並進行文字探勘,發現那些肉眼看不見、隱藏在資料底下的意義。此課程主要設計給對於 R 語言有基本認識,想要進一步熟悉實作分析的朋友們,希望在課程結束後,您能夠更熟悉 R 語言這個豐富的分析工具。透過蘋果日報慈善捐款的資料集,了解如何從頭解析網頁,撰寫爬蟲自動化收集資訊;取得資料後,能夠靈活處理資料,做清洗、整合及探索;並利用現成的套件進行文字探勘、文本解析;我們將一步步實際走一回資料分析的歷程,處理、觀察、解構資料,試著看看人們在捐款的決策過程中,究竟是什麼因素產生了影響,以及這些結果又是如何從資料中挖掘而出的呢?
The PyConTW (http://tw.pycon.org) organizer wishes to improve the quality and quantity of the programming cummunities in Taiwan. Though Python is their core tool and methodology, they know it's worth to learn and communicate with wide-ranging communities. Understanding cultures and ecosystem of a language takes me about three to six months. This six-hour course wraps up what I - an experienced Java developer - have learned from Python ecosystem and the agenda of the past PyConTW.
你可以在以下鏈結找到中文內容:
http://www.codedata.com.tw/python/python-tutorial-the-1st-class-1-preface
The document discusses using the SPEA2 algorithm for multi-objective optimization to find non-dominated classification rules from transaction data. It describes classification rule mining, objectives of accuracy, comprehensibility and interestingness, and the SPEA2 approach which uses selection, crossover and mutation operators over generations to find a non-dominated solution set. A case study applies SPEA2 on insurance broker transaction data to extract non-dominated rules relating customer attributes to insurance products.
This document provides an overview of web development in Python. It includes an example of a simple web application that connects to a MySQL database and displays the top 10 books ordered by publication date. It also lists some popular Python web development frameworks, including Django, Flask, and Pyramid, and provides references to their websites.
This document discusses various methods for reading and writing files in Python, including open(), read(), readline(), readlines(), write(), seek(), and tell(). It provides examples of opening files, reading the contents, writing new text, and changing the file position. The open() function is used to open a file and return a file object, which then has various methods that can be called to perform operations on the file.
This document provides an overview of the Python programming language. It begins by explaining what Python is - a general purpose, interpreted programming language that can be used as both a programming and scripting language. It then discusses the differences between programs and scripting languages. The history and creator of Python, Guido van Rossum, are outlined. The document explores the scope of Python and what tasks it can be used for. Popular companies and industries that use Python today are listed. Reasons why people use Python, such as it being free, powerful, and portable, are provided. Instructions for installing Python and running Python code are included. The document covers Python code execution and introduces basic Python concepts like variables, strings, data types, lists
25. 25
單樣本 T 檢驗 I
目標:檢驗 Science 的平均是否為 60。
t.test() 的基本語法
t.test(資料, alternative = "t" 或 "l" 或 "g",
mu = 假說平均數, ...)
> # 雙尾:
> t.test(dt$Science, alternative = "t", mu = 60)
> # 右單尾:
> t.test(dt$Science, alternative = "g", mu = 60)
> # 左單尾:
> t.test(dt$Science, alternative = "l", mu = 60)
26. 26
單樣本 T 檢驗 II
> t.test(dt$Science, mu = 60)
One Sample t-test
data: dt$Science
t = 1.5393, df = 8, p-value = 0.1623
alternative hypothesis: true mean is not equal to 60
95 percent confidence interval:
54.63219 86.92336
sample estimates:
mean of x
70.77778
27. 27
成對樣本 T 檢驗 I
目標:檢驗 Literature 和 Science 差之平均是否為 0。
t.test() 的基本語法
t.test(資料1, 資料2,
alternative = "t" 或 "l" 或 "g",
mu = 假說中配對差的平均數, pair = T, ...)
> # 預設雙尾;預設平均差為零
> t.test(dt$Literature, dt$Science, pair = T)
28. 28
成對樣本 T 檢驗 II
> t.test(dt$Literature, dt$Science, pair = T)
Paired t-test
data: dt$Literature and dt$Science
t = -4.2126, df = 8, p-value = 0.002945
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-17.193365 -5.028857
sample estimates:
mean of the differences
-11.11111
29. 29
獨立雙樣本 T 檢驗 I
目標:檢驗二種 Gender 的 Literature 之平均是否相等。
t.test() 的基本語法
t.test(資料一, 資料二, mu = 假說中平均數的差,
alternative = "t" 或 "l" 或 "g",
var.equal = T 或 F, ...)
t.test(應變數 ~ 二類類別因子,
data = 資料框, ...)
> t.test(subset(dt, Gender == "m")$Literature,
+ subset(dt, Gender == "f")$Literature,
+ var.equal = T)
> t.test(Literature ~ Gender, data = dt, var.equal = T)
30. 30
獨立雙樣本 T 檢驗 II
> t.test(Literature ~ Gender, data = dt, var.equal = T)
Two Sample t-test
data: Literature by Gender
t = -0.8823, df = 7, p-value = 0.4069
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-43.60845 19.90845
sample estimates:
mean in group f mean in group m
54.40 66.25
31. 31
盒形圖
boxplot() 的基本語法
boxplot(應變數 ~ 類別因子, data = 資料框, ...)
> boxplot(Literature ~ Gender, data = dt,
+ ylab = "Literature score", xlab = "Gender")
f m
30507090
Gender
Literaturescore
32. 32
單因子變異數分析 I
目標:檢驗三種 Group 的 Literature 之平均是否相等,並
進行 Tukey 事後檢驗。
aov() 和 TukeyHSD() 的基本語法
aov(應變數 ~ 三組以上類別自變數,
data = 資料框, ...)
TukeyHSD(aov物件, "分組因子", ...)
> fit.1 <- aov(Literature ~ Group, data = dt)
> summary(fit.1) # Type I sum of square
> TukeyHSD(fit.1, "Group")
33. 33
單因子變異數分析 II
> fit.1 <- aov(Literature ~ Group, data = dt)
> summary(fit.1)
Df Sum Sq Mean Sq F value Pr(>F)
Group 2 2.7 1.3 0.003 0.997
Residuals 6 3115.3 519.2
> TukeyHSD(fit.1, "Group")
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = Literature ~ Group, data = dt)
$Group
diff lwr upr p adj
B-A 0.6666667 -56.41875 57.75209 0.9992924
C-A 1.3333333 -55.75209 58.41875 0.9971738
C-B 0.6666667 -56.41875 57.75209 0.9992924
34. 34
盒形圖
boxplot() 的基本語法
boxplot(應變數 ~ 類別因子, data = 資料框, ...)
> boxplot(Literature ~ Group, data = dt,
+ ylab = "Literature score", xlab = "Group")
A B C
30507090
Group
Literaturescore
35. 35
簡單線性迴歸 I
目標:建立 Science 對應 Literature 的簡單線性迴歸模型,
並檢驗斜率是否為零。
lm() 的基本語法
lm(應變數 ~ 連續自變數, data = 資料框, ...)
> fit.2 <- lm(Literature ~ Science, data = dt)
> summary(fit.2)
> anova(fit.2) # Type I sum of square
36. 36
簡單線性迴歸 II
> fit.2 <- lm(Literature ~ Science, data = dt);
> summary(fit.2)
Call:
lm(formula = Literature ~ Science, data = dt)
Residuals:
Min 1Q Median 3Q Max
-16.894 -1.085 2.494 4.269 8.113
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9625 9.8294 -0.200 0.847422
Science 0.8707 0.1337 6.511 0.000331 ***
---
Residual standard error: 7.946 on 7 degrees of freedom
Multiple R-squared: 0.8583, Adjusted R-squared: 0.838
F-statistic: 42.39 on 1 and 7 DF, p-value: 0.0003308
37. 37
簡單線性迴歸 III
> anova(fit.2)
Analysis of Variance Table
Response: Literature
Df Sum Sq Mean Sq F value Pr(>F)
Science 1 2676.08 2676.08 42.389 0.0003308 ***
Residuals 7 441.92 63.13
38. 38
簡單線性相關 I
目標:計算 Science 與 Literature 的簡單線性相關係數是否
為零。
cor.test() 的基本語法
cor.test(資料一, 資料二,
alternative = "t" 或 "l" 或 "g", ...)
cor.test( ~ 資料一 + 資料二, data = 資料框, ...)
> cor.test(dt$Literature, dt$Science)
> cor.test(~ Literature + Science, data = dt)
> cor.test(~ Science + Literature, data = dt)
39. 39
簡單線性相關 II
> cor.test(dt$Literature, dt$Science)
Pearson's product-moment correlation
data: dt$Literature and dt$Science
t = 6.5107, df = 7, p-value = 0.0003308
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.6817766 0.9847014
sample estimates:
cor
0.9264278
40. 40
散佈圖 I
coef() 的基本語法
coef(lm物件, ...) # 取出各迴歸係數
plot.formula() 和 abline() 的基本語法
plot(縱軸資料 ~ 橫軸資料, data = 資料框, ...)8
abline(a = coef(迴歸物件)[1],
b = coef(迴歸物件)[2],
lty, col, ...) # 畫上迴歸線
> plot(Literature ~ Science, data = dt)
> abline(a = coef(fit.2)[1], b = coef(fit.2)[2], lty = 3)
46. 46
英文書籍推薦
英文書選擇極多。我推薦以下幾本我喜歡或值得閱讀的。
? “Biostatistical Design and Analysis Using R: A Practical
Guide” by Murray Logan. Wiley-Blackwell Press.
實驗設計和 R 並重,非常推薦。
? “The R Book, 2nd
Edition” by Michael J. Crawley. Wiley
Press.
較不易閱讀,但仍值得細讀。R 語言和統計併重。
? “A First Course in Statistical Programming with R” by
W. John Braun & Duncan J. Murdoch. Cambridge
University Press.
易讀。統計學基礎內容為主,但實驗設計部份少。
47. 47
網路教學
?《R 演習室》@ youtube.com9
針對初學者的 R 視訊教學系列。有廣告,但有提供影
片載點。
? http://www.r-software.org/home
中華 R 軟體學會。收錄許多中文影片與中文教學,內
容豐富,亦適合初學者。
? “Quick-R”by Robert I. Kabacoff10
我常用的速查網站。
? 英文的的網路教學非常多,請自行搜尋「R tutorial」。
9
https://www.youtube.com/playlist?list=PL5AC0ADBF65924EAD
10
http://www.statmethods.net/
51. 51
Q&A 的時間又到囉
Q 如何找能做某件事的套件?
A 請 Google 大神幫你找最快。真的。
Q 阿盤學多久才叫「上手」、「有生產力」?
A 自學半年以上,但我今天就要把八成功力都傳給你
了!
Q 聽到這裡,我想認輸了……我想重回用滑鼠搞定的世
界。
A 只要是適合自己的工具,就是好工具。
52. 52
今日的總複習
? 建立一個(適合自己的)R 工作環境
? 了解 R 的函數與如何閱讀其使用手冊
? R 如何讀取並整理資料
? 練習常見的統計方法
? 讓自己更厲害的資源
> cat("Have wonderful R experiences!n")
> q()
60. 60
參考解法 V
以 GDP.10000 分組對 HIV.rate 之獨立雙樣本 T 檢驗:
> t.test(HIV.rate ~ GDP.10000,
+ data = mydt0, var.equal = T)
Two Sample t-test
data: HIV.rate by GDP.10000
t = -1.6351, df = 70, p-value = 0.1065
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-2.0576478 0.2036993
sample estimates:
mean in group high mean in group low
0.286087 1.213061
註:此例使用 t.test(..., var.equal = F) 可能較洽當(因為二組的變方差距不
小),甚至參考使用無母數方法 two-sample Wilcoxon test wilcox.test() 或
two-sample Kolmogorov-Smirnov test ks.test()。