ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
Introduction to Machine
Learning
with Apache Spark!
Spark Meetup, 12.03.2015, Marko Veli? PhD
Lecturer
? 2014 - PhD in machine Learning, Faculty of
Organisation and Informatics, Varazdin, UNIZG
? Dozen of papers, projects and two patents pending in
machine learning
? Work experience:
? 2015. Data Lab ¨C consulting, ?Data Science¡± and machine
learning for some of the biggest companies (both Croatian
and global)
? Currently establishing Big Data department at Styria group
? 2013-2015 ¨C University Computing Centre, head of data
analysis department
? 2007-2013 ¨C CEO of one small development company
? Since 2011. Lecturer at Algebra University (C++, ML etc)
? Interests: artificial intelligence, machine learning,
computer vision, deep learning
Survey ¨C Your experience with
ML?
? Used/developed in commercial projects
? Used/developed in academia
? Trying out on my own
? Never have used
? Never heard
How do they do it?
Content
? What is AI?
? What is ML?
? Learning types
? Variable types
? Spark MLlib and ML
? Naive Bayes
? Model testing
? Demo
? Where to learn ML? What¡¯s next?
What is AI?
AI
Heuristics
Rules +
Logic
Fuzzy
Logic
Machine
Learning
What is ML?
Information
Theory
Statistics,
Probability,
Mathematics
Software
Engineering
Learning types
? Supervised
? Class is known
? Learning from experience
? Unsupervised
? Class is unknown
? Grouping (searching for) similar
points
Trminology
Synonyms in Croatian Synonyms in English
Opservacija, podatak Observation, Data instance, Example,
Data Sample, Point
Klasa, zavisna varijabla, ciljna varijabla Class, Dependent variable, Goal,
Outcome
Varijabla, zna?ajka, atribut, nezavisna
var.
Variable, Feature, Attribute,
Independent var.
Prenau?enost, pretreniranost modela Model Overfitting
Kontinuirane, kvantitativne varijable Continuous, Numeric, Quantitative
Diskretne, kvalitativne varijable Discrete, Qualitative
Klasifikacija, raspoznavanje,
razvrstavanje
Classification
Grupiranje, klasteriranje Clustering
Anotirani, ozna?eni podaci Annotated, Labelled Dataset (Points)
Data/Variable Types
Discrete
Nominal Ordinal
Continuous
Interval Ratio
= , <> > , < , >= , <= + , - * , /Possible operations:
Why is this important?
? Descriptive statistics
? Preprocessing techniques
? Choosing the ML method/algorithm
? Testing methodologies
? Results interpretation
More on this:
https://www.youtube.com/
watch?v=YFC2KUmEebc
David Mease, Google Tech
Talks 2007
Spark
? MLlib
? Longer development
? Lots of developers and methods
? Tested well
? ML
? New
? Shoud make ML in Spark easier
? Support for the entire ML ?pipeline¡±
? Alpha
? Bugs?
Spark ¨C ML methods (MLlib)
? Data types
? Basic statistics
? summary statistics
? correlations
? stratified sampling
? hypothesis testing
? random data generation
? Classification and regression
? linear models (SVMs, logistic regression, linear regression)
? naive Bayes
? decision trees
? ensembles of trees (Random Forests and Gradient-Boosted Trees)
? Collaborative filtering
? alternating least squares (ALS)
? Clustering
? k-means
? Dimensionality reduction
? singular value decomposition (SVD)
? principal component analysis (PCA)
? Feature extraction and transformation
? Optimization (developer)
? stochastic gradient descent
? limited-memory BFGS (L-BFGS)
Naive Bayes
Chills Runny Nose Headache Fever Flu?
Yes No Moderate Yes No
Yes Yes No No Yes
Yes No Strong Yes Yes
No Yes Moderate Yes Yes
No No No No No
No Yes Strong Yes Yes
No Yes Strong No No
Yes Yes Moderate Yes Yes
Yes No Moderate No ?
? What about the next patient? Symptoms:
Calculation 1/2
Condition Probability Condition Probability
P(Flu=Yes) 0,625 P(Flu=No) 0,375
P(Chills=Yes|Flu=Yes) 0,6 P(Chills=Yes|Flu=No) 0,333
P(Chills=No|Flu=Yes) 0,4 P(Chills=No|Flu=No) 0,666
P(Runny Nose=Yes|Flu=Yes) 0,8 P(Runny Nose=Yes|Flu=No) 0,333
P(Runny Nose=No|Flu=Yes) 0,2 P(Runny Nose=No|Flu=No) 0,666
P(Headache=Moderate|Flu=Yes) 0,4 P(Headache=Moderate|Flu=No) 0,333
P(Headache=No|Flu=Yes) 0,2 P(Headache=No|Flu=No) 0,333
P(Headache=Strong|Flu=Yes) 0,4 P(Headache=Strong|Flu=No) 0,333
P(Temperature=Yes|Flu=Yes) 0,8 P(Temperature=Yes|Flu=No) 0,333
P(Temperature=No|Flu=Yes) 0,2 P(Temperature=No|Flu=No) 0,666
)(
)()|(
)|(
EP
HPHEP
EHP
?
?
Calculation 2/2
? Za pacijenta:
? Just multiply:
? P(Flu=Yes)P(Chills=Yes|Flu=Yes)P(Runny
Nose=No|Flu=Yes)P(Headache=Moderate|Flu=Yes)P(Temperature
=No|Flu=Yes) = ?
? P(Flu=No)P(Chills=Yes|Flu=No)P(Runny
Nose=No|Flu=No)P(Headache=Moderate|Flu=No)P(Temperature=
No|Flu=No) = ?
Example source: https://www.youtube.com/watch?v=ZAfarappAO0
Chills Runny Nose Headache Fever Flu?
Yes No Moderate No ?
Model testing ¨C confusion matrix
and error types
Predicted Value
Positive (P¡¯) Negative (N¡¯)
Actual Value
Positive (P) True Positive (TP) False Negative (FN)
Negative (N) False Positive (FP) True Negative (TN)
Model testing ¨C success/accuracy
measures
? Classification Accuracy
? (TP+TN)/(TP+TN+FP+FN)
? Sensitivity
? TP/P = TP/(TP+FN)
? Specificity
TN/N = TN/(TN+FP)
? Positive Predictive Value PPV
TP/P¡¯ = TP/(TP+FP)
? Negative Predictive Value NPV
TN/N¡¯ = TN/(TN + FN)
Why ML in Spark?
? MLlib (and ML) based on Spark
? Speed comes from Spark (distributed learning, in
memory, fault tolerance etc...)
? Lots of Algorothms
? API is simple to use
? Various languages (Scala, Java, Python)
? Open source community (very active)
? Simple integration with other Spark components
eg. Spark Streaming and ?online¡± learning
? Spark ecosystem for the entire ?pipeline¡±
Source: "MLlib: Spark's Machine Learning Library" by Ameet Talwalkar at
AMPCamp 5 - http://www.slideshare.net/jeykottalam/mllib
Features
? Always starting with ?table¡±
? Rows are data points
? Columns are variables/features
? Dense ¨C All fields are filled
? Sparse ¨C Only ?non-zero¡± data
? Feature hashing
?John likes to watch movies.
?Mary likes movies too.
?John also likes football.
?John likes to watch movies. Mary likes too.
John also likes to watch football games.¡±
Dictionary: {"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7,
"games": 8, "Mary": 9, "too": 10}
Matrix: [[1 2 1 1 1 0 0 0 1 1] [1 1 1 1 0 1 1 1 0 0]]
Sources: http://en.wikipedia.org/wiki/Feature_hashing and
http://stats.stackexchange.com/questions/73325/understanding-feature-hashing
Spark Demo ¨C Sentiment Analysis
? Annotated dataset of
business news in
Croatian language
? Source: icapital.hr
? Small dataset (500)
? We do not expect
spectacular results ?
? Three classes
? Positive
? Negative
? Neutral?
Natural Language Processing /
Text Mining
? Preprocessing
? Stemming
? Lemamatization
? Features
? Bag of Words, n-grams
? TF(t) (Term Frequency) = Occurances of term t in
document / Total number of terms in document
? IDF(t) (Inverse Document Frequency) = log(Total number
of documents / Documents containing t)
? Linguistic variables...
NLP in Croatia
? FFZG
? Free components
? http://nlp.ffzg.hr
? FER
? Text Mining Add-On for Orange
? https://bitbucket.org/biolab/orange-text/src
? FOI ¨C www.foi.hr
? Someone else?
Typical ML/NLP workflow (Orange)
Most of this we can do in Spark, soon all of it (ML ?Pipelines¡±)...
Where to learn ML?
? Coratian universities
? FER, FOI, PMF, Algebra, FFZG for NLP etc.
? By yourself ¨C Internet ?
? Papers, books, blogs
? MOOCs (Coursera, edX etc.)
? Famous https://www.coursera.org/course/ml
? Prerequisites (beside programming):
? https://www.khanacademy.org/math/differential-calculus
? https://www.khanacademy.org/math/linear-algebra
? https://www.khanacademy.org/math/probability
? https://www.coursera.org/course/matrix
? https://www.coursera.org/learn/calculus1
? Great resource for Spark: http://ampcamp.berkeley.edu/
Next lectures?
? Entropy and variable importance?
? Methods
? Linear regression and optimization (Gradient descent)
? Logistic regression
? Decision trees (Random Forests)
? Unsupervised learning
? Collaborative filtering
? Neural networks (not in Spark ? - for now ?)
? ...
? Model testing (sampling, measures, ROC curve...)
? ML tips&tricks (regularization, overfitting etc.)
? ...
Content
? What is AI?
? What is ML?
? Learning types
? Variable types
? Spark MLlib and ML
? Naive Bayes
? Model testing
? Demo
? Where to learn ML? What¡¯s next?

More Related Content

Viewers also liked (20)

Driving Sales, Engagement, and Loyalty Through Mobile Marketing
Driving Sales, Engagement, and Loyalty Through Mobile MarketingDriving Sales, Engagement, and Loyalty Through Mobile Marketing
Driving Sales, Engagement, and Loyalty Through Mobile Marketing
Vivastream
?
HISD
HISDHISD
HISD
Matt Muller
?
Dossier de prensa del Seminario de la Fundaci¨®n Mu?ecos por el DesarrolloDossier de prensa del Seminario de la Fundaci¨®n Mu?ecos por el Desarrollo
Dossier de prensa del Seminario de la Fundaci¨®n Mu?ecos por el Desarrollo
NuriaCastejon
?
A mobile technology (M-Tech) pilot for the Zambian Food Reserve Agency
A mobile technology (M-Tech) pilot for the Zambian Food Reserve Agency A mobile technology (M-Tech) pilot for the Zambian Food Reserve Agency
A mobile technology (M-Tech) pilot for the Zambian Food Reserve Agency
Technical Centre for Agricultural and Rural Cooperation ACP-EU (CTA)
?
Ad1007101Ad1007101
Ad1007101
fresia medrano
?
III encuentros internacionales ecosocialistas: PonentesIII encuentros internacionales ecosocialistas: Ponentes
III encuentros internacionales ecosocialistas: Ponentes
Manu Robles-Arangiz Institutua Fundazioa
?
Plaguicidas cap¨ªtulo 14Plaguicidas cap¨ªtulo 14
Plaguicidas cap¨ªtulo 14
Adriana Arana
?
Akibat hukum pertalian persusuan
Akibat hukum pertalian persusuanAkibat hukum pertalian persusuan
Akibat hukum pertalian persusuan
JONI & TANAMAS LAW OFFICE
?
NL Chamber / Macedonia: Newsletter no. 5
NL Chamber / Macedonia: Newsletter no. 5NL Chamber / Macedonia: Newsletter no. 5
NL Chamber / Macedonia: Newsletter no. 5
NL Chamber / Macedonia
?
Wind Power in Portugal
Wind Power in PortugalWind Power in Portugal
Wind Power in Portugal
Daniel Campos
?
vital technologies
vital technologiesvital technologies
vital technologies
Shamim Iqbal
?
Programas Formativos Social MediaProgramas Formativos Social Media
Programas Formativos Social Media
Mi Empresa En Redes Sociales
?
Impulsa2   porqu¨¦ un social crmImpulsa2   porqu¨¦ un social crm
Impulsa2 porqu¨¦ un social crm
Impulsa2 Consultoria SL
?
Internet y redes sociales dcaInternet y redes sociales dca
Internet y redes sociales dca
Daniiel Claviijoo
?
Clase 03 Espiritismo   AstrologiaClase 03 Espiritismo   Astrologia
Clase 03 Espiritismo Astrologia
Miguel Neira
?
PFC - Migraci¨®n de un entorno web a Cloud Computing Amazon EC2 6PFC - Migraci¨®n de un entorno web a Cloud Computing Amazon EC2 6
PFC - Migraci¨®n de un entorno web a Cloud Computing Amazon EC2 6
David Fernandez
?
 Informe de resultados xarxa valencia turisme Informe de resultados xarxa valencia turisme
Informe de resultados xarxa valencia turisme
Gers¨®n Beltran
?
Social Business - From Stickmen and Cubicles to Whipping and a Princess Cake
Social Business - From Stickmen and Cubicles to Whipping and a Princess CakeSocial Business - From Stickmen and Cubicles to Whipping and a Princess Cake
Social Business - From Stickmen and Cubicles to Whipping and a Princess Cake
IBM Danmark
?
Session 1 | ?From Strategy to Image.? | Hello Apple
Session 1 | ?From Strategy to Image.? | Hello AppleSession 1 | ?From Strategy to Image.? | Hello Apple
Session 1 | ?From Strategy to Image.? | Hello Apple
Patrick Andersen Brand Consultancy
?
Driving Sales, Engagement, and Loyalty Through Mobile Marketing
Driving Sales, Engagement, and Loyalty Through Mobile MarketingDriving Sales, Engagement, and Loyalty Through Mobile Marketing
Driving Sales, Engagement, and Loyalty Through Mobile Marketing
Vivastream
?
Dossier de prensa del Seminario de la Fundaci¨®n Mu?ecos por el DesarrolloDossier de prensa del Seminario de la Fundaci¨®n Mu?ecos por el Desarrollo
Dossier de prensa del Seminario de la Fundaci¨®n Mu?ecos por el Desarrollo
NuriaCastejon
?
Ad1007101Ad1007101
Ad1007101
fresia medrano
?
III encuentros internacionales ecosocialistas: PonentesIII encuentros internacionales ecosocialistas: Ponentes
III encuentros internacionales ecosocialistas: Ponentes
Manu Robles-Arangiz Institutua Fundazioa
?
Plaguicidas cap¨ªtulo 14Plaguicidas cap¨ªtulo 14
Plaguicidas cap¨ªtulo 14
Adriana Arana
?
Programas Formativos Social MediaProgramas Formativos Social Media
Programas Formativos Social Media
Mi Empresa En Redes Sociales
?
Impulsa2   porqu¨¦ un social crmImpulsa2   porqu¨¦ un social crm
Impulsa2 porqu¨¦ un social crm
Impulsa2 Consultoria SL
?
Internet y redes sociales dcaInternet y redes sociales dca
Internet y redes sociales dca
Daniiel Claviijoo
?
Clase 03 Espiritismo   AstrologiaClase 03 Espiritismo   Astrologia
Clase 03 Espiritismo Astrologia
Miguel Neira
?
PFC - Migraci¨®n de un entorno web a Cloud Computing Amazon EC2 6PFC - Migraci¨®n de un entorno web a Cloud Computing Amazon EC2 6
PFC - Migraci¨®n de un entorno web a Cloud Computing Amazon EC2 6
David Fernandez
?
 Informe de resultados xarxa valencia turisme Informe de resultados xarxa valencia turisme
Informe de resultados xarxa valencia turisme
Gers¨®n Beltran
?
Social Business - From Stickmen and Cubicles to Whipping and a Princess Cake
Social Business - From Stickmen and Cubicles to Whipping and a Princess CakeSocial Business - From Stickmen and Cubicles to Whipping and a Princess Cake
Social Business - From Stickmen and Cubicles to Whipping and a Princess Cake
IBM Danmark
?

Similar to Intro_to_ML (20)

Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Databricks
?
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
Mateusz Dymczyk
?
Babysitting your orm essenmacher, adam
Babysitting your orm   essenmacher, adamBabysitting your orm   essenmacher, adam
Babysitting your orm essenmacher, adam
Adam Essenmacher
?
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2O
Sri Ambati
?
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
CrowdFlower
?
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
Shree Shree
?
Graph Databases for SQL Server Professionals
Graph Databases for SQL Server ProfessionalsGraph Databases for SQL Server Professionals
Graph Databases for SQL Server Professionals
St¨¦phane Fr¨¦chette
?
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
?
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
nhm taveer hossain khan
?
Natural Language Processing using Java
Natural Language Processing using JavaNatural Language Processing using Java
Natural Language Processing using Java
Sangameswar Venkatraman
?
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
Databricks
?
Data science 101 Masterclass
Data science 101 MasterclassData science 101 Masterclass
Data science 101 Masterclass
Ben Keen
?
Deep Learning Automated Helpdesk
Deep Learning Automated HelpdeskDeep Learning Automated Helpdesk
Deep Learning Automated Helpdesk
Pranav Sharma
?
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
Rachel Berryman
?
Data Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptxData Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptx
charlslabarda
?
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
?
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software Testing
Lionel Briand
?
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
Databricks
?
Expos¨¦ Ontology
Expos¨¦ OntologyExpos¨¦ Ontology
Expos¨¦ Ontology
Joaquin Vanschoren
?
Data analytics, a (short) tour
Data analytics, a (short) tourData analytics, a (short) tour
Data analytics, a (short) tour
Venkatesh Prasad Ranganath
?
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the CloudLeveraging NLP and Deep Learning for Document Recommendations in the Cloud
Leveraging NLP and Deep Learning for Document Recommendations in the Cloud
Databricks
?
Machine Learning for (JVM) Developers
Machine Learning for (JVM) DevelopersMachine Learning for (JVM) Developers
Machine Learning for (JVM) Developers
Mateusz Dymczyk
?
Babysitting your orm essenmacher, adam
Babysitting your orm   essenmacher, adamBabysitting your orm   essenmacher, adam
Babysitting your orm essenmacher, adam
Adam Essenmacher
?
Stacked Ensembles in H2O
Stacked Ensembles in H2OStacked Ensembles in H2O
Stacked Ensembles in H2O
Sri Ambati
?
How Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment AnalysisHow Oracle Uses CrowdFlower For Sentiment Analysis
How Oracle Uses CrowdFlower For Sentiment Analysis
CrowdFlower
?
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
Shree Shree
?
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
Paco Nathan
?
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
Databricks
?
Data science 101 Masterclass
Data science 101 MasterclassData science 101 Masterclass
Data science 101 Masterclass
Ben Keen
?
Deep Learning Automated Helpdesk
Deep Learning Automated HelpdeskDeep Learning Automated Helpdesk
Deep Learning Automated Helpdesk
Pranav Sharma
?
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
Rachel Berryman
?
Data Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptxData Processing DOH Workshop.pptx
Data Processing DOH Workshop.pptx
charlslabarda
?
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
Paco Nathan
?
Artificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software TestingArtificial Intelligence for Automated Software Testing
Artificial Intelligence for Automated Software Testing
Lionel Briand
?
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
Databricks
?

Intro_to_ML

  • 1. Introduction to Machine Learning with Apache Spark! Spark Meetup, 12.03.2015, Marko Veli? PhD
  • 2. Lecturer ? 2014 - PhD in machine Learning, Faculty of Organisation and Informatics, Varazdin, UNIZG ? Dozen of papers, projects and two patents pending in machine learning ? Work experience: ? 2015. Data Lab ¨C consulting, ?Data Science¡± and machine learning for some of the biggest companies (both Croatian and global) ? Currently establishing Big Data department at Styria group ? 2013-2015 ¨C University Computing Centre, head of data analysis department ? 2007-2013 ¨C CEO of one small development company ? Since 2011. Lecturer at Algebra University (C++, ML etc) ? Interests: artificial intelligence, machine learning, computer vision, deep learning
  • 3. Survey ¨C Your experience with ML? ? Used/developed in commercial projects ? Used/developed in academia ? Trying out on my own ? Never have used ? Never heard
  • 4. How do they do it?
  • 5. Content ? What is AI? ? What is ML? ? Learning types ? Variable types ? Spark MLlib and ML ? Naive Bayes ? Model testing ? Demo ? Where to learn ML? What¡¯s next?
  • 6. What is AI? AI Heuristics Rules + Logic Fuzzy Logic Machine Learning
  • 8. Learning types ? Supervised ? Class is known ? Learning from experience ? Unsupervised ? Class is unknown ? Grouping (searching for) similar points
  • 9. Trminology Synonyms in Croatian Synonyms in English Opservacija, podatak Observation, Data instance, Example, Data Sample, Point Klasa, zavisna varijabla, ciljna varijabla Class, Dependent variable, Goal, Outcome Varijabla, zna?ajka, atribut, nezavisna var. Variable, Feature, Attribute, Independent var. Prenau?enost, pretreniranost modela Model Overfitting Kontinuirane, kvantitativne varijable Continuous, Numeric, Quantitative Diskretne, kvalitativne varijable Discrete, Qualitative Klasifikacija, raspoznavanje, razvrstavanje Classification Grupiranje, klasteriranje Clustering Anotirani, ozna?eni podaci Annotated, Labelled Dataset (Points)
  • 10. Data/Variable Types Discrete Nominal Ordinal Continuous Interval Ratio = , <> > , < , >= , <= + , - * , /Possible operations: Why is this important? ? Descriptive statistics ? Preprocessing techniques ? Choosing the ML method/algorithm ? Testing methodologies ? Results interpretation More on this: https://www.youtube.com/ watch?v=YFC2KUmEebc David Mease, Google Tech Talks 2007
  • 11. Spark ? MLlib ? Longer development ? Lots of developers and methods ? Tested well ? ML ? New ? Shoud make ML in Spark easier ? Support for the entire ML ?pipeline¡± ? Alpha ? Bugs?
  • 12. Spark ¨C ML methods (MLlib) ? Data types ? Basic statistics ? summary statistics ? correlations ? stratified sampling ? hypothesis testing ? random data generation ? Classification and regression ? linear models (SVMs, logistic regression, linear regression) ? naive Bayes ? decision trees ? ensembles of trees (Random Forests and Gradient-Boosted Trees) ? Collaborative filtering ? alternating least squares (ALS) ? Clustering ? k-means ? Dimensionality reduction ? singular value decomposition (SVD) ? principal component analysis (PCA) ? Feature extraction and transformation ? Optimization (developer) ? stochastic gradient descent ? limited-memory BFGS (L-BFGS)
  • 13. Naive Bayes Chills Runny Nose Headache Fever Flu? Yes No Moderate Yes No Yes Yes No No Yes Yes No Strong Yes Yes No Yes Moderate Yes Yes No No No No No No Yes Strong Yes Yes No Yes Strong No No Yes Yes Moderate Yes Yes Yes No Moderate No ? ? What about the next patient? Symptoms:
  • 14. Calculation 1/2 Condition Probability Condition Probability P(Flu=Yes) 0,625 P(Flu=No) 0,375 P(Chills=Yes|Flu=Yes) 0,6 P(Chills=Yes|Flu=No) 0,333 P(Chills=No|Flu=Yes) 0,4 P(Chills=No|Flu=No) 0,666 P(Runny Nose=Yes|Flu=Yes) 0,8 P(Runny Nose=Yes|Flu=No) 0,333 P(Runny Nose=No|Flu=Yes) 0,2 P(Runny Nose=No|Flu=No) 0,666 P(Headache=Moderate|Flu=Yes) 0,4 P(Headache=Moderate|Flu=No) 0,333 P(Headache=No|Flu=Yes) 0,2 P(Headache=No|Flu=No) 0,333 P(Headache=Strong|Flu=Yes) 0,4 P(Headache=Strong|Flu=No) 0,333 P(Temperature=Yes|Flu=Yes) 0,8 P(Temperature=Yes|Flu=No) 0,333 P(Temperature=No|Flu=Yes) 0,2 P(Temperature=No|Flu=No) 0,666 )( )()|( )|( EP HPHEP EHP ? ?
  • 15. Calculation 2/2 ? Za pacijenta: ? Just multiply: ? P(Flu=Yes)P(Chills=Yes|Flu=Yes)P(Runny Nose=No|Flu=Yes)P(Headache=Moderate|Flu=Yes)P(Temperature =No|Flu=Yes) = ? ? P(Flu=No)P(Chills=Yes|Flu=No)P(Runny Nose=No|Flu=No)P(Headache=Moderate|Flu=No)P(Temperature= No|Flu=No) = ? Example source: https://www.youtube.com/watch?v=ZAfarappAO0 Chills Runny Nose Headache Fever Flu? Yes No Moderate No ?
  • 16. Model testing ¨C confusion matrix and error types Predicted Value Positive (P¡¯) Negative (N¡¯) Actual Value Positive (P) True Positive (TP) False Negative (FN) Negative (N) False Positive (FP) True Negative (TN)
  • 17. Model testing ¨C success/accuracy measures ? Classification Accuracy ? (TP+TN)/(TP+TN+FP+FN) ? Sensitivity ? TP/P = TP/(TP+FN) ? Specificity TN/N = TN/(TN+FP) ? Positive Predictive Value PPV TP/P¡¯ = TP/(TP+FP) ? Negative Predictive Value NPV TN/N¡¯ = TN/(TN + FN)
  • 18. Why ML in Spark? ? MLlib (and ML) based on Spark ? Speed comes from Spark (distributed learning, in memory, fault tolerance etc...) ? Lots of Algorothms ? API is simple to use ? Various languages (Scala, Java, Python) ? Open source community (very active) ? Simple integration with other Spark components eg. Spark Streaming and ?online¡± learning ? Spark ecosystem for the entire ?pipeline¡±
  • 19. Source: "MLlib: Spark's Machine Learning Library" by Ameet Talwalkar at AMPCamp 5 - http://www.slideshare.net/jeykottalam/mllib
  • 20. Features ? Always starting with ?table¡± ? Rows are data points ? Columns are variables/features ? Dense ¨C All fields are filled ? Sparse ¨C Only ?non-zero¡± data ? Feature hashing ?John likes to watch movies. ?Mary likes movies too. ?John also likes football. ?John likes to watch movies. Mary likes too. John also likes to watch football games.¡± Dictionary: {"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7, "games": 8, "Mary": 9, "too": 10} Matrix: [[1 2 1 1 1 0 0 0 1 1] [1 1 1 1 0 1 1 1 0 0]] Sources: http://en.wikipedia.org/wiki/Feature_hashing and http://stats.stackexchange.com/questions/73325/understanding-feature-hashing
  • 21. Spark Demo ¨C Sentiment Analysis ? Annotated dataset of business news in Croatian language ? Source: icapital.hr ? Small dataset (500) ? We do not expect spectacular results ? ? Three classes ? Positive ? Negative ? Neutral?
  • 22. Natural Language Processing / Text Mining ? Preprocessing ? Stemming ? Lemamatization ? Features ? Bag of Words, n-grams ? TF(t) (Term Frequency) = Occurances of term t in document / Total number of terms in document ? IDF(t) (Inverse Document Frequency) = log(Total number of documents / Documents containing t) ? Linguistic variables...
  • 23. NLP in Croatia ? FFZG ? Free components ? http://nlp.ffzg.hr ? FER ? Text Mining Add-On for Orange ? https://bitbucket.org/biolab/orange-text/src ? FOI ¨C www.foi.hr ? Someone else?
  • 24. Typical ML/NLP workflow (Orange) Most of this we can do in Spark, soon all of it (ML ?Pipelines¡±)...
  • 25. Where to learn ML? ? Coratian universities ? FER, FOI, PMF, Algebra, FFZG for NLP etc. ? By yourself ¨C Internet ? ? Papers, books, blogs ? MOOCs (Coursera, edX etc.) ? Famous https://www.coursera.org/course/ml ? Prerequisites (beside programming): ? https://www.khanacademy.org/math/differential-calculus ? https://www.khanacademy.org/math/linear-algebra ? https://www.khanacademy.org/math/probability ? https://www.coursera.org/course/matrix ? https://www.coursera.org/learn/calculus1 ? Great resource for Spark: http://ampcamp.berkeley.edu/
  • 26. Next lectures? ? Entropy and variable importance? ? Methods ? Linear regression and optimization (Gradient descent) ? Logistic regression ? Decision trees (Random Forests) ? Unsupervised learning ? Collaborative filtering ? Neural networks (not in Spark ? - for now ?) ? ... ? Model testing (sampling, measures, ROC curve...) ? ML tips&tricks (regularization, overfitting etc.) ? ...
  • 27. Content ? What is AI? ? What is ML? ? Learning types ? Variable types ? Spark MLlib and ML ? Naive Bayes ? Model testing ? Demo ? Where to learn ML? What¡¯s next?