�ݺ�ߣ

Introduction to Machine
Learning
with Apache Spark!
Spark Meetup, 12.03.2015, Marko Veli? PhD

Lecturer
? 2014 - PhD in machine Learning, Faculty of
Organisation and Informatics, Varazdin, UNIZG
? Dozen of papers, projects and two patents pending in
machine learning
? Work experience:
? 2015. Data Lab �C consulting, ?Data Science�� and machine
learning for some of the biggest companies (both Croatian
and global)
? Currently establishing Big Data department at Styria group
? 2013-2015 �C University Computing Centre, head of data
analysis department
? 2007-2013 �C CEO of one small development company
? Since 2011. Lecturer at Algebra University (C++, ML etc)
? Interests: artificial intelligence, machine learning,
computer vision, deep learning

Survey �C Your experience with
ML?
? Used/developed in commercial projects
? Used/developed in academia
? Trying out on my own
? Never have used
? Never heard

Content
? What is AI?
? What is ML?
? Learning types
? Variable types
? Spark MLlib and ML
? Naive Bayes
? Model testing
? Demo
? Where to learn ML? What��s next?

What is AI?
AI
Heuristics
Rules +
Logic
Fuzzy
Logic
Machine
Learning

What is ML?
Information
Theory
Statistics,
Probability,
Mathematics
Software
Engineering

Learning types
? Supervised
? Class is known
? Learning from experience
? Unsupervised
? Class is unknown
? Grouping (searching for) similar
points

Trminology
Synonyms in Croatian Synonyms in English
Opservacija, podatak Observation, Data instance, Example,
Data Sample, Point
Klasa, zavisna varijabla, ciljna varijabla Class, Dependent variable, Goal,
Outcome
Varijabla, zna?ajka, atribut, nezavisna
var.
Variable, Feature, Attribute,
Independent var.
Prenau?enost, pretreniranost modela Model Overfitting
Kontinuirane, kvantitativne varijable Continuous, Numeric, Quantitative
Diskretne, kvalitativne varijable Discrete, Qualitative
Klasifikacija, raspoznavanje,
razvrstavanje
Classification
Grupiranje, klasteriranje Clustering
Anotirani, ozna?eni podaci Annotated, Labelled Dataset (Points)

Data/Variable Types
Discrete
Nominal Ordinal
Continuous
Interval Ratio
= , <> > , < , >= , <= + , - * , /Possible operations:
Why is this important?
? Descriptive statistics
? Preprocessing techniques
? Choosing the ML method/algorithm
? Testing methodologies
? Results interpretation
More on this:
https://www.youtube.com/
watch?v=YFC2KUmEebc
David Mease, Google Tech
Talks 2007

Spark
? MLlib
? Longer development
? Lots of developers and methods
? Tested well
? ML
? New
? Shoud make ML in Spark easier
? Support for the entire ML ?pipeline��
? Alpha
? Bugs?

Spark �C ML methods (MLlib)
? Data types
? Basic statistics
? summary statistics
? correlations
? stratified sampling
? hypothesis testing
? random data generation
? Classification and regression
? linear models (SVMs, logistic regression, linear regression)
? naive Bayes
? decision trees
? ensembles of trees (Random Forests and Gradient-Boosted Trees)
? Collaborative filtering
? alternating least squares (ALS)
? Clustering
? k-means
? Dimensionality reduction
? singular value decomposition (SVD)
? principal component analysis (PCA)
? Feature extraction and transformation
? Optimization (developer)
? stochastic gradient descent
? limited-memory BFGS (L-BFGS)

Naive Bayes
Chills Runny Nose Headache Fever Flu?
Yes No Moderate Yes No
Yes Yes No No Yes
Yes No Strong Yes Yes
No Yes Moderate Yes Yes
No No No No No
No Yes Strong Yes Yes
No Yes Strong No No
Yes Yes Moderate Yes Yes
Yes No Moderate No ?
? What about the next patient? Symptoms:

Calculation 2/2
? Za pacijenta:
? Just multiply:
? P(Flu=Yes)P(Chills=Yes|Flu=Yes)P(Runny
Nose=No|Flu=Yes)P(Headache=Moderate|Flu=Yes)P(Temperature
=No|Flu=Yes) = ?
? P(Flu=No)P(Chills=Yes|Flu=No)P(Runny
Nose=No|Flu=No)P(Headache=Moderate|Flu=No)P(Temperature=
No|Flu=No) = ?
Example source: https://www.youtube.com/watch?v=ZAfarappAO0
Chills Runny Nose Headache Fever Flu?
Yes No Moderate No ?

Model testing �C confusion matrix
and error types
Predicted Value
Positive (P��) Negative (N��)
Actual Value
Positive (P) True Positive (TP) False Negative (FN)
Negative (N) False Positive (FP) True Negative (TN)

Model testing �C success/accuracy
measures
? Classification Accuracy
? (TP+TN)/(TP+TN+FP+FN)
? Sensitivity
? TP/P = TP/(TP+FN)
? Specificity
TN/N = TN/(TN+FP)
? Positive Predictive Value PPV
TP/P�� = TP/(TP+FP)
? Negative Predictive Value NPV
TN/N�� = TN/(TN + FN)

Why ML in Spark?
? MLlib (and ML) based on Spark
? Speed comes from Spark (distributed learning, in
memory, fault tolerance etc...)
? Lots of Algorothms
? API is simple to use
? Various languages (Scala, Java, Python)
? Open source community (very active)
? Simple integration with other Spark components
eg. Spark Streaming and ?online�� learning
? Spark ecosystem for the entire ?pipeline��

Source: "MLlib: Spark's Machine Learning Library" by Ameet Talwalkar at
AMPCamp 5 - http://www.slideshare.net/jeykottalam/mllib

Features
? Always starting with ?table��
? Rows are data points
? Columns are variables/features
? Dense �C All fields are filled
? Sparse �C Only ?non-zero�� data
? Feature hashing
?John likes to watch movies.
?Mary likes movies too.
?John also likes football.
?John likes to watch movies. Mary likes too.
John also likes to watch football games.��
Dictionary: {"John": 1, "likes": 2, "to": 3, "watch": 4, "movies": 5, "also": 6, "football": 7,
"games": 8, "Mary": 9, "too": 10}
Matrix: [[1 2 1 1 1 0 0 0 1 1] [1 1 1 1 0 1 1 1 0 0]]
Sources: http://en.wikipedia.org/wiki/Feature_hashing and
http://stats.stackexchange.com/questions/73325/understanding-feature-hashing

Spark Demo �C Sentiment Analysis
? Annotated dataset of
business news in
Croatian language
? Source: icapital.hr
? Small dataset (500)
? We do not expect
spectacular results ?
? Three classes
? Positive
? Negative
? Neutral?

Natural Language Processing /
Text Mining
? Preprocessing
? Stemming
? Lemamatization
? Features
? Bag of Words, n-grams
? TF(t) (Term Frequency) = Occurances of term t in
document / Total number of terms in document
? IDF(t) (Inverse Document Frequency) = log(Total number
of documents / Documents containing t)
? Linguistic variables...

NLP in Croatia
? FFZG
? Free components
? http://nlp.ffzg.hr
? FER
? Text Mining Add-On for Orange
? https://bitbucket.org/biolab/orange-text/src
? FOI �C www.foi.hr
? Someone else?

Typical ML/NLP workflow (Orange)
Most of this we can do in Spark, soon all of it (ML ?Pipelines��)...

Where to learn ML?
? Coratian universities
? FER, FOI, PMF, Algebra, FFZG for NLP etc.
? By yourself �C Internet ?
? Papers, books, blogs
? MOOCs (Coursera, edX etc.)
? Famous https://www.coursera.org/course/ml
? Prerequisites (beside programming):
? https://www.khanacademy.org/math/differential-calculus
? https://www.khanacademy.org/math/linear-algebra
? https://www.khanacademy.org/math/probability
? https://www.coursera.org/course/matrix
? https://www.coursera.org/learn/calculus1
? Great resource for Spark: http://ampcamp.berkeley.edu/

Next lectures?
? Entropy and variable importance?
? Methods
? Linear regression and optimization (Gradient descent)
? Logistic regression
? Decision trees (Random Forests)
? Unsupervised learning
? Collaborative filtering
? Neural networks (not in Spark ? - for now ?)
? ...
? Model testing (sampling, measures, ROC curve...)
? ML tips&tricks (regularization, overfitting etc.)
? ...

�ݺ�ߣ

Intro_to_ML

Recommended

More Related Content

Viewers also liked (20)

Similar to Intro_to_ML (20)

Intro_to_ML