際際滷

際際滷Share a Scribd company logo
暑 一危磯ゼ 豌危 蠍一螻
螳 覿 
2018.11.22.
  (jangy@sejong.edu)
一危一螳 郁規
語蟲
1
Background
2
蠍郁(Machine learning) 襦語
Raw data
Data
preprocessing
Prepared
data
Apply
algorithms
Candidate
model
Chosen
model
Application
Iterate until data is ready Iterate for best model
 蠍郁 襦語れ 覈 覈呉 蟲 蟆
 覈語 蠍一ヾ 給 一危磯ゼ 覦朱, 襦 一危一 覲企ゼ 豢豢
  Application 伎 覲企ゼ 詞  
 一危 螻狩 蠍郁 襦語れ 覦襯 覈語 蟲  襦
一危 豌襴,  螻襴讀 , 覈 螻 螳 螻殊 蟆一
3
蠍郁(Machine learning) 襦語
Raw data
Data
preprocessing
Prepared
data
Apply
algorithms
Candidate
model
Chosen
model
Application
Iterate until data is ready Iterate for best model
蠍郁 襦語る  讌覓語 
 蠍郁旧 伎 企 蟆 り 螳?
 襯 る,  豬レ 覦 襯 豢豌 螻襴讀 
4
蠍郁(Machine learning) 襦語
Raw data
Data
preprocessing
Prepared
data
Apply
algorithms
Candidate
model
Chosen
model
Application
Iterate until data is ready Iterate for best model
 れ 螻 譴觜 一危(Prepared data) 豢伎 
 譯殊伎 raw 一危磯ゼ 蠏碁襦 蠍磯慨る  襦 覲伎 覿 螻殊
  , 螳 蠍郁旧 牛 企螻  覈  螻襴讀
 螳 襷襦 一危磯ゼ 覲 螻殊 覩誤
5
蠍郁(Machine learning) 襦語
Raw data
Data
preprocessing
Prepared
data
Apply ML
algorithms
Candidate
model
Chosen
model
Application
Iterate until data is ready Iterate for best model
 譴觜 一危磯ゼ 詞 , 磯Μ 覈 煙  蠍郁 螻襴讀 
   覈語 燕  蟾讌 覦覲旧朱 螻襴讀  覦 ろ 伎狩
6
覓語
Raw data
Data
preprocessing
Prepared
data
Apply ML
algorithms
Candidate
model
Chosen
model
Application
Iterate until data is ready Iterate for best model
Missing
data
X
Raw data Missing value螳 る, ML 襦語れ 企 覓語螳 覦螳?
 Prepared data: 覲語 煙 殊 一危郁 焔
 Model: 企 蟆郁骸螳 讌 覓企   , Overfitting  Underfitting 覦
 Application:  碁 るジ 危 豢豌 螳レ煙 
7
企至 伎 螳?
Data
preprocessing
Prepared
data
Apply ML
algorithms
Candidate
model
Chosen
model
Application
Iterate for best model
Missing
data
Complete
data
Imputation
 Missing data襯 complete data襦 豢  襦
 imputation(豌) 覈語 伎伎 
8
Imputation (豌)
 Imputation企, 暑 一危磯ゼ 豌 螳朱 豌危 襦語
 Missing data 轟 磯  imputation 覈語 伎 
Missing
data
Complete
data
Imputation
 Listwise deletion
 Single imputation
- Hot-deck
- Cold-deck
- Mean substation
- Interpolation
 Multiple imputation
 Model based approach
 .
Missing data
轟 覿
 imputation
覈 
9
Imputation (豌)
Missing
data
Complete
data
Imputation
 Listwise deletion
 Single imputation
- Hot-deck
- Cold-deck
- Mean substation
- Interpolation
 Multiple imputation
 Model based approach
 .
 Imputation企, 暑 一危磯ゼ 豌 螳朱 豌危 襦語
 Missing data 轟 磯  imputation 覈語 伎 
  imputation 覈語 蠍 伎, missing data 轟  
Missing data
轟 覿
 imputation
覈 
10
Missing data
11
Missing data?
 Missing data 蟯谿磯 覲  一危 螳 ル讌  蟆曙 覩誤1
 暑 一危磯 企 伎襦 蠍磯讌 螻, 一危 誤語  一危磯ゼ 覩誤
12
1. Graham, John W. "Missing data analysis: Making it work in the real world." Annual review of psychology 60 (2009): 549-576.
 覿れ 讀螳襦 誤 クル 覿 蟆郁骸螳 豢1
 糾 覿 蟆郁骸襯 襤壱  
 糾 煙 螻°  
Missing data襦 誤 ?
Iris data2
(a) 一危 曙  蟆曙
Average Petal length: 3.113
(b) Petal length  覓伎襦 33%螳 暑 蟆曙
Average Petal length: 3.735
(c) Petal length  螳  33%螳 暑 蟆曙
Average Petal length: 4.906
 Missing 譟郁唄 磯 一危 轟煙 覲
13
1. Stuart, Elizabeth A., et al. "Multiple imputation with large data sets: a case study of the Children's Mental Health Initiative." American journal of epidemiology 169.9 (2009)
2. Fernstad, Sara Johansson. "To identify what is not there: A definition of missingness patterns and evaluation of missing value visualization." Information Visualization (2018)
Summary
 Missing value襯 豌襴 豕 企 れ 讌 
 Missing朱 誤 覲 れ 覦 蟆曙, る 一危磯 覈讌 轟煙 讌
覈詩覩襦 覿 蟆郁骸 襤磯 糾 蟆レ 讌
 Missing value襯 豌襴 覦覯 磯 覿蟆郁骸 覩語 レ 蠍殊, 覿
 
 蠏碁殊 覿蟲螻 Missing value襦 誤 覲 れ 覿 蟆郁骸 覩語 レ 螻ろ覃,
Missing value襯 蟆 豌襴 蟆 譴
 蠏碁覩襦 覿螳 Missing data 轟 覿  豌企逢覯 谿場 
 Missing data襯 imputation蠍    覲
 一危一 missing 覦 
 豌 一危一 missing 谿讌 觜
 Missing type
14
 missing 覦螳?
 る 譟一: 旧螳 る語 谿語 朱 朱 讌覓語 牛讌  蟆曙
  ろ: ろ螳 ろ 譴螳 豪 伎襦 ろ 碁 蟆曙
 一危 蟆壱: 覈視 蟆壱 譟郁唄朱 一危郁 覈 蟆壱 蟆曙
 一危 讌: 一危 讌 譟郁唄 轟 覓語螳 覦伎 一危郁 暑 蟆曙
  語 一危 曙 覦   語 襷れ れ
15
Missing type
 Missing type 磯 missing value襯 豌襴   覦覯 る
 Missing type 蟆 3螳讌襦 蟲覿  1
 MCAR (Missing Completely at random):  覓伎襦 
 MAR (Missing at random) 覓伎 
 NMAR (Not missing at random): 覓伎襦 暑讌 
16
1. Little, Roderick JA, and Donald B. Rubin. Statistical analysis with missing data. Vol. 333. John Wiley & Sons, 2014.
Missing type 轟 る蠍 
 ,: i覯讌  j覯讌 蟯豸′
 ,: i覯讌  蟯豸° 覦 螳る 企伎 一危
 ,: i覯讌  蟯豸°讌  missing 覦 一危
  谿語  i螳 6螳 襷れ 覲 覦覓誤 轟 豺襯 豸′ 蟆曙,
 i 1, 2, 3 蟯豸′ 螳螳 100, 105, 110
,1 = 100, ,2 = 105, ,3 = 110  , = (100,105,110)
  i螳 4覿  谿語襯 譴蠍磯  蟆曙, 4~6 蟯豸′ missing 覦
,4 = , ,5 = , ,6 =   , = (, , )
 : i 一危一 蟆一検 覦覿襯 誤 蠍壱 (1=蟆一検, 0=蟯豸)
 = (0,0,0,1,1,1)
17
豢豌: 所覦  糾- 糾覿, : 一碁蟲 螳麹 蟲
MCAR (Missing completely at random)
 螳 , , 襴曙 蟆曙, i覯讌  蟯豸 螳 missing pattern MCAR手 覿襴
 Missing 覦 覿螳 蟯豸 螳 , 譟危讌 る 
 襯 れ
 1覿 6蟾讌 豌伎 螳 ろ 谿瑚 螳 
  1覿 3蟾讌 覲 豌伎 豸′
  4 螻レ 螳 る 覲 豌伎 豸′讌 覈詩
 蠏碁  5螻 6 覲 豌伎 豸′
 伎 螳 蟆曙 4 覦 missing value , , 蟯螻螳 蠍 覓語 MCAR企手 覲  
18
豢豌: 所覦  糾- 糾覿, : 一碁蟲 螳麹 蟲
MAR (Missing at random)
 螳 , 譟危, , 襴曙伎  蟆曙, i覯讌  蟯豸 螳 missing pattern
MAR手 覿襴
 Missing 覦 覿螳 蟯豸 螳 , 襷 蟯   蟆曙
 襯 れ
 1覿 6蟾讌 豌伎 螳 ろ 谿瑚 螳 
  ろ 牛 豌伎 譴   蟆企 襷
   讌 豌伎 螳讌 螻,  危 3 覲 覦覓誤讌 
 讌襷  れ ろ 谿語蟆る 蟆一 蟲轄 襾語 蠍郁  ろ 谿語
  蟆曙, 3 覦 missing value 豌    蟯豸  螳 , 譟危讌襷, , 蟯螻螳
蠍 覓語 MAR企手 覲  
19
豢豌: 所覦  糾- 糾覿, : 一碁蟲 螳麹 蟲
MNAR (Missing not at random)
 螳 ,螻 , 譟危 蟆曙, i覯讌  蟯豸 螳 missing pattern MNAR手 覿襴
 Missing 覦 覿螳 蟯豸 螳 , 螻 蟆一検 螳 , 覈 蟯   蟆曙
 襯 れ
 1覿 6蟾讌 豌伎 螳 ろ 谿瑚 螳 
 1覿 3蟾讌  豌伎 螳
 4  覲 覦覓誤蠍  讌 豌伎 豸′企慨 1 豸′ 豌伎朱 
  危 4 覲 覦覓誤讌 
  蟆曙 4 覦 missing value 1~3 蟯豸″ 螳 , 訖襷  磯Μ螳 蟯豸″讌 覈詩 ,
譟危蠍 覓語 MNAR企手 覲  
20
豢豌: 所覦  糾- 糾覿, : 一碁蟲 螳麹 蟲
Missing type 所 襴覃
 Missing type Missing 覦蟆  語  螻殊
 Missing data Missing type 3螳讌襦 覿襯   
 MCAR(  ), MAR(覓伎 ), MNAR(襦 暑讌 )
 Missing type Missing 覦覿螳 蟯豸° 螳  蟆一検 螳螻 郁覿 磯 蟆一
 MCAR: Missing 覦 覿 蟯豸 螳, 蟆一検 螳螻 蟯螻螳 
 MAR: Missing 覦 覿 蟯豸 螳襷 蟯   蟆曙
 MNAR: Missing 覦 覿 蟯豸 螳, 蟆一検 螳螻 蟯螻螳  蟆曙
21
Dealing with missing data
22
Missing data襯 豌危 覦覯
MCAR,
MAR
NMAR
Missing
data
Single
imputation
Multiple
imputation
糾 覦覯朱 一危磯ゼ
豌危   覈語 
Mean, Regression,
Stochastic regression,
k-NN, 
Hot deck, Cold deck,
Substitution, Deletion
Missing type
23
Mean imputation
 蠏 螳 蟆一検 螳朱 豺 覦覯
 螻殊 豢 覓語螳 覦  
24
Regression imputation
 蟯豸° 覲れ 伎 覲 伎 覈 蟲 , 蟆一検 螳 豢 覦覯
x y
1 2.8
1.5 3.0
2 2.9
2.5 2.6
3 2.1
3.5 missing
4 1.2
4.5 1.0
5 1.0
5.5 1.3
6 missing
6.5 2.2
7 2.7
7.5 2.9
8 3.0
8.5 2.8
9 2.4
9.5 1.9
10 1.5
10.5 1.1
蟯豸° 螳る
伎
蠏 覈 豢
 = sin  + 2
蠏 伎
missing value 豢
: 蟯豸° 一危, 覿: missing data 25
Regression imputation  R code
26
R-code
Source: Templ, Matthias, and Peter Filzmoser. "Visualization of missing values using the R-package VIM." Reserach report cs-2008-1, Department of Statistics and
Probability Therory, Vienna University of Technology (2008).
K-NN imputation
 K-NN (K-豕蠏殊 伎)螻襴讀 伎 imputation
 K=6企 れ覃,
 Missing value襯 譴朱  , 6螳 一危郁  覯 れ伎  蟾讌  ロ
   6螳 一危郁 覃, 螳 襷 一危 企る missing value 企るゼ 豌危
X Y Class
35 62 a
57 11 a
98 46 b
52 24 a
33 19 a
40 70 missing
28 56 a
21 89 a
94 17 b
10 37 a
73 88 b
97 77 b
37 37 a
95 72
36 9 a
25 93 a
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90
y
x
?
Missing value
K=6
27
K-NN imputation  R code
28
R-code
Source: Templ, Matthias, and Peter Filzmoser. "Visualization of missing values using the R-package VIM." Reserach report cs-2008-1, Department of Statistics and
Probability Therory, Vienna University of Technology (2008).
Interpolation
 Interpolation
 れ 一危 覯 伎  一危 誤碁ゼ 蟲燕 覦覯
Piecewise constant interpolation Linear interpolation Spline interpolation
29
Multiple imputation
 Single imputation 蟆一検豺襯 螳讌 襭 覿 蠍 危,
豢 譴 れ姶 螻殊 豢 覦  
 Multiple imputation n覯 simple imputation 伎 n螳 螳  襭
襯 襷れ伎, 豢螻 覿一 螻壱 覦覯
30
Multiple imputation  R code
31
R-code
Source: Templ, Matthias, and Peter Filzmoser. "Visualization of missing values using the R-package VIM." Reserach report cs-2008-1, Department of Statistics and
Probability Therory, Vienna University of Technology (2008).
Multiple imputation
1. Single imputation 覦覯 n 覯 覦覲牛 n 螳  一危 
2. n 螳  一危一 豢 missing value 螳螻 覿 螻
3. Rubins rule 伎 n螳  一危一 missing value 螳螻 覿一 螻壱
揃揃揃
Incomplete data Complete data
暑 螳
豢 螳
覲 豢
Rubins rule
: 一危一 覲襦 蟲 豢豺
: 豢豺 譴れ姶
W: 豌  覿(within-imputation variance)
B: 豌 螳 覿(Between-imputation variance)
32
螳 
33
 螳   譟一襯 讌  朱, 螳 煙 暑 螳 伎 
  .
螳 
MCAR,
MAR
NMAR
Missing
data
Single
imputation
Multiple
imputation
糾 覦覯朱 一危磯ゼ
豌危   覈語 
Explicit
modeling
Implicit
modeling
Mean, Regression,
Stochastic regression
Hot deck, Cold deck,
Substitution, Deletion
Single imputationMissing pattern
  譟一
34
Missing pattern 覿  tool/package
 Tool
 Tableau: Interactive data exploration software
 R Package
 VIM: Visualization and imputation of missing values
 Amelia2: Bootstrap EM imputation
35
Tableau
36
VIM(Visualization and imputation of missing values) package
 暑 螳 螳 螻, imputation 覈語    R package
譯殊 蠍磯
 Visualization
 Marginplot
 Matrixplot
 Histogram
 Imputation model
 kNN
 Hotdeck
 Regression
37
VIM Package
 Aggregations for missing/imputed values
 Calculate or plot the amount of missing/imputed values in each variable and the amount of
missing/imputed values in certain combinations of variables.
Variables Variables
NonD, Dream, Span missing 覦 觜 1.6%
Missing data
Observed data
38
VIM Package
 Margin plot: Scatterplot with additional information in the margins
Missing data
Observed data
39
VIM Package
 Matrix plot
 In a matrix plot, all cells of a data matrix are
visualized by rectangles.
 Available data is coded according to a
continuous color scheme.
 Missing values can easily be distinguished by
using a color such as red/orange.
40
Visualization technique of missing data
Song, Hayeong, and Danielle Albers Szafir. "Where's My Data? Evaluating Visualizations with Missing Data." IEEE transactions on visualization and computer graphics (2018).
41
 蟯 伎 覓語
 レ (jangy@sejong.edu)
 壱覲 (hbyeon109@gmail.com)
42

More Related Content

[2018 Bigdata win-win conference] 4

  • 1. 暑 一危磯ゼ 豌危 蠍一螻 螳 覿 2018.11.22. (jangy@sejong.edu) 一危一螳 郁規 語蟲 1
  • 3. 蠍郁(Machine learning) 襦語 Raw data Data preprocessing Prepared data Apply algorithms Candidate model Chosen model Application Iterate until data is ready Iterate for best model 蠍郁 襦語れ 覈 覈呉 蟲 蟆 覈語 蠍一ヾ 給 一危磯ゼ 覦朱, 襦 一危一 覲企ゼ 豢豢 Application 伎 覲企ゼ 詞 一危 螻狩 蠍郁 襦語れ 覦襯 覈語 蟲 襦 一危 豌襴, 螻襴讀 , 覈 螻 螳 螻殊 蟆一 3
  • 4. 蠍郁(Machine learning) 襦語 Raw data Data preprocessing Prepared data Apply algorithms Candidate model Chosen model Application Iterate until data is ready Iterate for best model 蠍郁 襦語る 讌覓語 蠍郁旧 伎 企 蟆 り 螳? 襯 る, 豬レ 覦 襯 豢豌 螻襴讀 4
  • 5. 蠍郁(Machine learning) 襦語 Raw data Data preprocessing Prepared data Apply algorithms Candidate model Chosen model Application Iterate until data is ready Iterate for best model れ 螻 譴觜 一危(Prepared data) 豢伎 譯殊伎 raw 一危磯ゼ 蠏碁襦 蠍磯慨る 襦 覲伎 覿 螻殊 , 螳 蠍郁旧 牛 企螻 覈 螻襴讀 螳 襷襦 一危磯ゼ 覲 螻殊 覩誤 5
  • 6. 蠍郁(Machine learning) 襦語 Raw data Data preprocessing Prepared data Apply ML algorithms Candidate model Chosen model Application Iterate until data is ready Iterate for best model 譴觜 一危磯ゼ 詞 , 磯Μ 覈 煙 蠍郁 螻襴讀 覈語 燕 蟾讌 覦覲旧朱 螻襴讀 覦 ろ 伎狩 6
  • 7. 覓語 Raw data Data preprocessing Prepared data Apply ML algorithms Candidate model Chosen model Application Iterate until data is ready Iterate for best model Missing data X Raw data Missing value螳 る, ML 襦語れ 企 覓語螳 覦螳? Prepared data: 覲語 煙 殊 一危郁 焔 Model: 企 蟆郁骸螳 讌 覓企 , Overfitting Underfitting 覦 Application: 碁 るジ 危 豢豌 螳レ煙 7
  • 8. 企至 伎 螳? Data preprocessing Prepared data Apply ML algorithms Candidate model Chosen model Application Iterate for best model Missing data Complete data Imputation Missing data襯 complete data襦 豢 襦 imputation(豌) 覈語 伎伎 8
  • 9. Imputation (豌) Imputation企, 暑 一危磯ゼ 豌 螳朱 豌危 襦語 Missing data 轟 磯 imputation 覈語 伎 Missing data Complete data Imputation Listwise deletion Single imputation - Hot-deck - Cold-deck - Mean substation - Interpolation Multiple imputation Model based approach . Missing data 轟 覿 imputation 覈 9
  • 10. Imputation (豌) Missing data Complete data Imputation Listwise deletion Single imputation - Hot-deck - Cold-deck - Mean substation - Interpolation Multiple imputation Model based approach . Imputation企, 暑 一危磯ゼ 豌 螳朱 豌危 襦語 Missing data 轟 磯 imputation 覈語 伎 imputation 覈語 蠍 伎, missing data 轟 Missing data 轟 覿 imputation 覈 10
  • 12. Missing data? Missing data 蟯谿磯 覲 一危 螳 ル讌 蟆曙 覩誤1 暑 一危磯 企 伎襦 蠍磯讌 螻, 一危 誤語 一危磯ゼ 覩誤 12 1. Graham, John W. "Missing data analysis: Making it work in the real world." Annual review of psychology 60 (2009): 549-576.
  • 13. 覿れ 讀螳襦 誤 クル 覿 蟆郁骸螳 豢1 糾 覿 蟆郁骸襯 襤壱 糾 煙 螻° Missing data襦 誤 ? Iris data2 (a) 一危 曙 蟆曙 Average Petal length: 3.113 (b) Petal length 覓伎襦 33%螳 暑 蟆曙 Average Petal length: 3.735 (c) Petal length 螳 33%螳 暑 蟆曙 Average Petal length: 4.906 Missing 譟郁唄 磯 一危 轟煙 覲 13 1. Stuart, Elizabeth A., et al. "Multiple imputation with large data sets: a case study of the Children's Mental Health Initiative." American journal of epidemiology 169.9 (2009) 2. Fernstad, Sara Johansson. "To identify what is not there: A definition of missingness patterns and evaluation of missing value visualization." Information Visualization (2018)
  • 14. Summary Missing value襯 豌襴 豕 企 れ 讌 Missing朱 誤 覲 れ 覦 蟆曙, る 一危磯 覈讌 轟煙 讌 覈詩覩襦 覿 蟆郁骸 襤磯 糾 蟆レ 讌 Missing value襯 豌襴 覦覯 磯 覿蟆郁骸 覩語 レ 蠍殊, 覿 蠏碁殊 覿蟲螻 Missing value襦 誤 覲 れ 覿 蟆郁骸 覩語 レ 螻ろ覃, Missing value襯 蟆 豌襴 蟆 譴 蠏碁覩襦 覿螳 Missing data 轟 覿 豌企逢覯 谿場 Missing data襯 imputation蠍 覲 一危一 missing 覦 豌 一危一 missing 谿讌 觜 Missing type 14
  • 15. missing 覦螳? る 譟一: 旧螳 る語 谿語 朱 朱 讌覓語 牛讌 蟆曙 ろ: ろ螳 ろ 譴螳 豪 伎襦 ろ 碁 蟆曙 一危 蟆壱: 覈視 蟆壱 譟郁唄朱 一危郁 覈 蟆壱 蟆曙 一危 讌: 一危 讌 譟郁唄 轟 覓語螳 覦伎 一危郁 暑 蟆曙 語 一危 曙 覦 語 襷れ れ 15
  • 16. Missing type Missing type 磯 missing value襯 豌襴 覦覯 る Missing type 蟆 3螳讌襦 蟲覿 1 MCAR (Missing Completely at random): 覓伎襦 MAR (Missing at random) 覓伎 NMAR (Not missing at random): 覓伎襦 暑讌 16 1. Little, Roderick JA, and Donald B. Rubin. Statistical analysis with missing data. Vol. 333. John Wiley & Sons, 2014.
  • 17. Missing type 轟 る蠍 ,: i覯讌 j覯讌 蟯豸′ ,: i覯讌 蟯豸° 覦 螳る 企伎 一危 ,: i覯讌 蟯豸°讌 missing 覦 一危 谿語 i螳 6螳 襷れ 覲 覦覓誤 轟 豺襯 豸′ 蟆曙, i 1, 2, 3 蟯豸′ 螳螳 100, 105, 110 ,1 = 100, ,2 = 105, ,3 = 110 , = (100,105,110) i螳 4覿 谿語襯 譴蠍磯 蟆曙, 4~6 蟯豸′ missing 覦 ,4 = , ,5 = , ,6 = , = (, , ) : i 一危一 蟆一検 覦覿襯 誤 蠍壱 (1=蟆一検, 0=蟯豸) = (0,0,0,1,1,1) 17 豢豌: 所覦 糾- 糾覿, : 一碁蟲 螳麹 蟲
  • 18. MCAR (Missing completely at random) 螳 , , 襴曙 蟆曙, i覯讌 蟯豸 螳 missing pattern MCAR手 覿襴 Missing 覦 覿螳 蟯豸 螳 , 譟危讌 る 襯 れ 1覿 6蟾讌 豌伎 螳 ろ 谿瑚 螳 1覿 3蟾讌 覲 豌伎 豸′ 4 螻レ 螳 る 覲 豌伎 豸′讌 覈詩 蠏碁 5螻 6 覲 豌伎 豸′ 伎 螳 蟆曙 4 覦 missing value , , 蟯螻螳 蠍 覓語 MCAR企手 覲 18 豢豌: 所覦 糾- 糾覿, : 一碁蟲 螳麹 蟲
  • 19. MAR (Missing at random) 螳 , 譟危, , 襴曙伎 蟆曙, i覯讌 蟯豸 螳 missing pattern MAR手 覿襴 Missing 覦 覿螳 蟯豸 螳 , 襷 蟯 蟆曙 襯 れ 1覿 6蟾讌 豌伎 螳 ろ 谿瑚 螳 ろ 牛 豌伎 譴 蟆企 襷 讌 豌伎 螳讌 螻, 危 3 覲 覦覓誤讌 讌襷 れ ろ 谿語蟆る 蟆一 蟲轄 襾語 蠍郁 ろ 谿語 蟆曙, 3 覦 missing value 豌 蟯豸 螳 , 譟危讌襷, , 蟯螻螳 蠍 覓語 MAR企手 覲 19 豢豌: 所覦 糾- 糾覿, : 一碁蟲 螳麹 蟲
  • 20. MNAR (Missing not at random) 螳 ,螻 , 譟危 蟆曙, i覯讌 蟯豸 螳 missing pattern MNAR手 覿襴 Missing 覦 覿螳 蟯豸 螳 , 螻 蟆一検 螳 , 覈 蟯 蟆曙 襯 れ 1覿 6蟾讌 豌伎 螳 ろ 谿瑚 螳 1覿 3蟾讌 豌伎 螳 4 覲 覦覓誤蠍 讌 豌伎 豸′企慨 1 豸′ 豌伎朱 危 4 覲 覦覓誤讌 蟆曙 4 覦 missing value 1~3 蟯豸″ 螳 , 訖襷 磯Μ螳 蟯豸″讌 覈詩 , 譟危蠍 覓語 MNAR企手 覲 20 豢豌: 所覦 糾- 糾覿, : 一碁蟲 螳麹 蟲
  • 21. Missing type 所 襴覃 Missing type Missing 覦蟆 語 螻殊 Missing data Missing type 3螳讌襦 覿襯 MCAR( ), MAR(覓伎 ), MNAR(襦 暑讌 ) Missing type Missing 覦覿螳 蟯豸° 螳 蟆一検 螳螻 郁覿 磯 蟆一 MCAR: Missing 覦 覿 蟯豸 螳, 蟆一検 螳螻 蟯螻螳 MAR: Missing 覦 覿 蟯豸 螳襷 蟯 蟆曙 MNAR: Missing 覦 覿 蟯豸 螳, 蟆一検 螳螻 蟯螻螳 蟆曙 21
  • 23. Missing data襯 豌危 覦覯 MCAR, MAR NMAR Missing data Single imputation Multiple imputation 糾 覦覯朱 一危磯ゼ 豌危 覈語 Mean, Regression, Stochastic regression, k-NN, Hot deck, Cold deck, Substitution, Deletion Missing type 23
  • 24. Mean imputation 蠏 螳 蟆一検 螳朱 豺 覦覯 螻殊 豢 覓語螳 覦 24
  • 25. Regression imputation 蟯豸° 覲れ 伎 覲 伎 覈 蟲 , 蟆一検 螳 豢 覦覯 x y 1 2.8 1.5 3.0 2 2.9 2.5 2.6 3 2.1 3.5 missing 4 1.2 4.5 1.0 5 1.0 5.5 1.3 6 missing 6.5 2.2 7 2.7 7.5 2.9 8 3.0 8.5 2.8 9 2.4 9.5 1.9 10 1.5 10.5 1.1 蟯豸° 螳る 伎 蠏 覈 豢 = sin + 2 蠏 伎 missing value 豢 : 蟯豸° 一危, 覿: missing data 25
  • 26. Regression imputation R code 26 R-code Source: Templ, Matthias, and Peter Filzmoser. "Visualization of missing values using the R-package VIM." Reserach report cs-2008-1, Department of Statistics and Probability Therory, Vienna University of Technology (2008).
  • 27. K-NN imputation K-NN (K-豕蠏殊 伎)螻襴讀 伎 imputation K=6企 れ覃, Missing value襯 譴朱 , 6螳 一危郁 覯 れ伎 蟾讌 ロ 6螳 一危郁 覃, 螳 襷 一危 企る missing value 企るゼ 豌危 X Y Class 35 62 a 57 11 a 98 46 b 52 24 a 33 19 a 40 70 missing 28 56 a 21 89 a 94 17 b 10 37 a 73 88 b 97 77 b 37 37 a 95 72 36 9 a 25 93 a 0 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 y x ? Missing value K=6 27
  • 28. K-NN imputation R code 28 R-code Source: Templ, Matthias, and Peter Filzmoser. "Visualization of missing values using the R-package VIM." Reserach report cs-2008-1, Department of Statistics and Probability Therory, Vienna University of Technology (2008).
  • 29. Interpolation Interpolation れ 一危 覯 伎 一危 誤碁ゼ 蟲燕 覦覯 Piecewise constant interpolation Linear interpolation Spline interpolation 29
  • 30. Multiple imputation Single imputation 蟆一検豺襯 螳讌 襭 覿 蠍 危, 豢 譴 れ姶 螻殊 豢 覦 Multiple imputation n覯 simple imputation 伎 n螳 螳 襭 襯 襷れ伎, 豢螻 覿一 螻壱 覦覯 30
  • 31. Multiple imputation R code 31 R-code Source: Templ, Matthias, and Peter Filzmoser. "Visualization of missing values using the R-package VIM." Reserach report cs-2008-1, Department of Statistics and Probability Therory, Vienna University of Technology (2008).
  • 32. Multiple imputation 1. Single imputation 覦覯 n 覯 覦覲牛 n 螳 一危 2. n 螳 一危一 豢 missing value 螳螻 覿 螻 3. Rubins rule 伎 n螳 一危一 missing value 螳螻 覿一 螻壱 揃揃揃 Incomplete data Complete data 暑 螳 豢 螳 覲 豢 Rubins rule : 一危一 覲襦 蟲 豢豺 : 豢豺 譴れ姶 W: 豌 覿(within-imputation variance) B: 豌 螳 覿(Between-imputation variance) 32
  • 34. 譟一襯 讌 朱, 螳 煙 暑 螳 伎 . 螳 MCAR, MAR NMAR Missing data Single imputation Multiple imputation 糾 覦覯朱 一危磯ゼ 豌危 覈語 Explicit modeling Implicit modeling Mean, Regression, Stochastic regression Hot deck, Cold deck, Substitution, Deletion Single imputationMissing pattern 譟一 34
  • 35. Missing pattern 覿 tool/package Tool Tableau: Interactive data exploration software R Package VIM: Visualization and imputation of missing values Amelia2: Bootstrap EM imputation 35
  • 37. VIM(Visualization and imputation of missing values) package 暑 螳 螳 螻, imputation 覈語 R package 譯殊 蠍磯 Visualization Marginplot Matrixplot Histogram Imputation model kNN Hotdeck Regression 37
  • 38. VIM Package Aggregations for missing/imputed values Calculate or plot the amount of missing/imputed values in each variable and the amount of missing/imputed values in certain combinations of variables. Variables Variables NonD, Dream, Span missing 覦 觜 1.6% Missing data Observed data 38
  • 39. VIM Package Margin plot: Scatterplot with additional information in the margins Missing data Observed data 39
  • 40. VIM Package Matrix plot In a matrix plot, all cells of a data matrix are visualized by rectangles. Available data is coded according to a continuous color scheme. Missing values can easily be distinguished by using a color such as red/orange. 40
  • 41. Visualization technique of missing data Song, Hayeong, and Danielle Albers Szafir. "Where's My Data? Evaluating Visualizations with Missing Data." IEEE transactions on visualization and computer graphics (2018). 41
  • 42. 蟯 伎 覓語 レ (jangy@sejong.edu) 壱覲 (hbyeon109@gmail.com) 42