18. ???? (Machine Learning, ML)
? 2016 SNU Biointelligence Laboratory, http://bi.snu.ac.kr/ 18
Q. If the season is dry and the pavement is slippery, did it rain?
A. Unlikely, it is more likely that the sprinkler was ON
??? ????
??? ?????
??? ??? ??? ????
????, ?? ?? ???? ??
?? ??
?? ??
??, ????
?? ?? ??
?? ?? ??
?? ?? ?? ??
29. Top 20 Most Popular Tools for
Big Data, Data Mining, and Data Science
(c)2008-2016, SNU Biointelligence Lab. 29Source: http://www.kdnuggets.com/2015/06/data-mining-data-science-tools-associations.html
Red: Free/Open Source tools
Green: Commercial tools
Fuchsia: Hadoop/Big Data tools
38. Approaching (Almost) Any Machine Learning Problem
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 38
FIGURE FROM: A. THAKUR AND A. KROHN-GRIMBERGHE, AUTOCOMPETE: A FRAMEWORK FOR MACHINE LEARNING
COMPETITIONS, AUTOML WORKSHOP, INTERNATIONAL CONFERENCE ON MACHINE LEARNING 2015.
http://blog.kaggle.com/2016/07/21/approaching-almost-any-machine-learning-problem-abhishek-thakur/
39. ??: ?? ??? ??? ??
? Description
? Pima Indians have the highest prevalence of diabetes in the world
? We will build classification models that diagnose if the patient shows signs of
diabetes
? http://archive.ics.uci.edu/ml/datasets/Pima+Indians+Diabetes
? Configuration of the data set
? 768 instances
? 8 attributes
? age, number of times pregnant, results of medical tests/analysis
? all numeric (integer or real-valued)
? Class label = 1 (Positive example )
? Interpreted as "tested positive for diabetes"
? 268 instances
? Class label = 0 (Negative example)
? 500 instances
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 39
40. WEKA? ??? ??? ???
? ??? ??(DESCRIPTIVE ANALYSIS)
Part
47. Weka? ?? ???? C ???? ?
47
click ? load a file that contains the
training data by clicking
`Open file¨ button
? `ARFF¨ or `CSV¨ formats are
readable
? Click `Classify¨ tab
? Click `Choose¨ button
? Select `weka C function
- MultilayerPerceptron
? Click `MultilayerPerceptron¨
? Set parameters for MLP
? Set parameters for Test
? Click `Start¨ for learning
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
49. Test Options and Classifier Output
49(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
There are
various metrics
for evaluation
Setting the
data set used
for evaluation
50. Classifier Output
?Run information
?Classifier model (full
training set)
?Evaluation results
? General summary
? Detailed accuracy by
class
? Confusion matrix
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 50
The output
depends on
the classifier
51. ????? (ANN) ??
? ANN? ?? ???? ??(functions-MultilayerPerceptron ??)
? learningRate -- The amount the weights are updated.
? momentum -- Momentum applied to the weights during updating.
? hiddenLayers C
? This defines the hidden layers of the neural network. This is a list of positive whole
numbers. 1 for each hidden layer. Comma seperated.
? Ex) 3: one hidden layer with 3 hidden nodes
? Ex) 5,3; two hidden layers with 5 and 3 hidden nodes, respectively
? To have no hidden layers put a single 0 here. This will only be used if autobuild is
set. There are also wildcard values 'a' = (attribs + classes) / 2, 'i' = attribs, 'o' =
classes , 't' = attribs + classes.
? trainingTime -- The number of epochs to train through. If the validation set is
non-zero then it can terminate the network early
? Experiments
? ?? ??: ?? ????? ??? ???? ?? ???? ?? ???? ??
?? ??
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 51
52. SVM ??
? SVM? ?? ???? ??(functions-SMO ??)
? c -- The complexity parameter C.
? kernel -- The kernel to use.
? PolyKernel -- The polynomial kernel : K(x, y) = <x, y>^p or K(x, y) = (<x,
y>+1)^p.
? ^exponent ̄ represents p in the equations.
? RBFKernel -- K(x, y) = e^-(gamma * <x-y, x-y>^2)
? gamma (γ) controls the width (range of neighborhood) of the kernel
? Experiments
? ?? ??: ??? ?? ??. ??? ?? ???? ?? ??
? PolyKernel: testing several exponents. {1, 2, 5}
? RBF kernel: ^grid-search" on C and γ using cross-validation.
? C = {0.1, 1, 10}, γ = {0.1, 1, 10}
? Reference
? A practical guide to SVM classification (http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf)
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 52
54. Using Experimenter in Weka
? Tool for `Batch¨ experiments
54
click
? Set experiment type/iteration
control
? Set datasets / algorithms
Click `New¨
? Select `Run¨ tab and click `Start¨
? If it has finished successfully, click
`Analyse¨ tab and see the summary
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr
55. Usages of the Experimenter
? Model selection for classification/regression
? Various approaches
? Repeated training/test set split
? Repeated cross-validation (c.f. double cross-validation)
? Averaging
? Comparison between models / algorithms
? Paired t-test
? On various metrics: accuracies / RMSE / etc.
? Batch and/or Distributed processing
? Load/save experiment settings
? http://weka.wikispaces.com/Remote+Experiment
? Multi-core support : utilize all the cores on a multi-core machine
(c)2008-2016, SNU Biointelligence Laboratory, http://bi.snu.ac.kr 55
67. What is Data?
?`data¨? ??
(c) 2008-2015, B.-H. Kim
67
^Data is a set of values of qualitative or quantitative
variables, belonging to a set of items. ̄
Variables: A measurement or characteristic of an item.
Qualitative: Country of origin, gender, treatment
Quantitative: Height, weight, blood pressure
68. What Is Data Mining?
? Data mining (knowledge discovery from data)
? Extraction of `interesting¨ patterns or knowledge from huge
amount of data
? `Interesting¨ means: non-trivial, implicit, previously unknown
and potentially useful
? Alternative names
? Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
? Watch out: Is everything ^data mining ̄?
? Simple search and query processing
? (Deductive) expert systems
68際際滷 from Lecture 際際滷 of Ch. 1 by J. Han, et al., for Data Mining: Concepts and Techniques
(c) 2008-2015, B.-H. Kim
69. Related Research Fields
(c) 2008-2015, B.-H. Kim
Data
Mining
Artificial
Intelligence
(AI)
Machine
Learning
(ML)
Deep
Learning
Data
Science
Information
Retrieval
(IR)
Knowledge
Discovery
from Data
(KDD)
Big Data
Analytics
Business
Intelligence
69
70. Machine Learning & Data Mining
(c) 2008-2015, B.-H. Kim
70
際際滷 from GECCO 2009 Tutorial on `Large Scale Data Mining using
Genetics-Based Machine Learning¨, by Jaume Bacardit and Xavier Llor┐
72. Data Science
(c) 2008-2015, B.-H. Kim
72
Figure source: http://nirvacana.com/thoughts/becoming-a-data-scientist/
73. ??? ??? ?? ??? ??? ??
? ???(descriptive)
? Describe a set of data
? ???(exploratory)
? Find relationships you didn't know about
? Correlation does not imply causation
? ???(inferential)
? Use a relatively small sample of data to say something about a bigger population
? Inference is commonly the goal of statistical models
? ???(predictive)
? To use the data on some objects to predict values for another object
? Accurate prediction depends heavily on measuring the right variables
? ???(causal)
? To find out what happens to one variable when you make another variable change
? ????(mechanistic)
? Understand the exact changes in variables that lead to changes in other variables for
individual objects
(c) 2008-2015, B.-H. Kim
73
J. Leek, Data Analysis C Structure of a Data Analysis, Lecture at Coursera, 2013
74. ?? ??? ?? ???? ??
? ???(descriptive)
? ?? ?? ??(a whole population)
? ???(exploratory)
? ??? ?? ? ??? ?? ??(a random sample with many variables
measured)
? ???(inferential)
? ???? ??? ?? ? ??? ??(the right population, randomly
sampled)
? ???(predictive)
? ??? ????? ?? ???? ??? ??? ??(a training and test
data set from the same population)
? ???(causal)
? ???? ??? ??? ???? ??? ??(data from a randomized
study)
? ????(mechanistic)
? ???? ?? ??? ???? ??? ??(data about all components of
the system)(c) 2008-2015, B.-H. Kim
74
J. Leek, Data Analysis C Structure of a Data Analysis, Lecture at Coursera, 2013
75. ??? ??(data analysis)? ??
? ?? ??(define the question)
? ??? ??(dataset)
? ???? ???? ??(define the ideal data set)
? ??? ??? ??(determine what data you can access)
? ??? ??(obtain the data)
? ??? ??(clean the data)
? ??? ??? ??(exploratory data analysis)
? ?????/??? ???(Clustering / Data visualization)
? ??? ??/???(statistical prediction/modeling)
? ??/??(Classification / Prediction)
? ?? ??(interpret results)
? ??? ?? ?? ? ??(evaluation), ?? ?? ?? ??(model selection)
? ?? ?? ? ??? ?? ?? ?? ? ??(challenge results)
? ?? ?? ? ??? ??(synthesize/write up results)
? ?? ?? ??? ???? ??(create reproducible code)
(c) 2008-2015, B.-H. Kim
75
J. Leek, Data Analysis C Structure of a Data Analysis, Lecture at Coursera, 2013
76. Raw versus processed data
?Raw data
? The original source of the data
? Often hard to use for data analyses
? Data analysis includes processing
? Raw data may only need to be processed once
?Processed data
? Data that is ready for analysis
? Processing can include merging, subsetting, transforming,
etc.
? There may be standards for processing
? All steps should be recorded
(c) 2008-2015, B.-H. Kim
76