際際滷

際際滷Share a Scribd company logo
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
#GHCI16
2016Introduction to Predictive Analytics-
Hands On Workshop Using R &
Python
Presenters:
Python
Lavanya Sita Tekumalla
Sharmistha Jat
R
Maheshwari Dhandapani
Subramanian Lakshminarayanan
Sowmya Venugopal
Bindu
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Agenda
?Basics of Predictive Modeling Techniques (30m)
?Hands on Workshop: Regression
? (1) Build Model : R (30m) (2) Build Model : Python(30m)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
What is Predictive Analytics?
Learn from available data and make meaningful
predictions
Why Predictive Analytics?
Too much data C too many scenarios...
Hard for humans to explicitly describe predictive rules
for all scenarios
Exercise: lets predict something´
Predict how long it takes to reach home
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Common Analytics Tasks...
Supervised Learning
Regression : Predict continuous target
Can I predict time taken to get home from past history?
Can I predict Sensex Value from past market history?
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Common Analytics Tasks...
Supervised Learning
Classification : Predict the class/type of object
Classify Images of Cats from Dogs from examples?
Identify hand written digits by studying examples
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Common Analytics Tasks...
Unsupervised Learning
Clustering : Identify groups inherent in data
Given a set of news articles, what are the underlying topics or themes?
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Predict Movie Success ??
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Predict Movie Success: Features
?Features:
CActors
CDirector
CGross budget
CSocial media feedback
CGenre and keywords
CRelease date
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Example: Predict Movie Sales?
Known Data:
Available advertising dollars
and corresponding sales for
lots of prior movies
Prediction Task:
For a new movie, given
advertising budget C can you
forecast sales ?
Regression:
Sales = f (Advertising budget)
How to learn f ????
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Example: Movie Hit / Flop from budget and
Trailer Facebook likes?
Known Data:
Available budgets and
facebook statistics of various
hit and flop movies...
Prediction Task:
For a new movie, I know budget
and facebook likes on trailer C
what is the probability of hit ?
Classification:
Can I learn the Seperating Line
Between hit and flop movies? Budget
FacebookLikes
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
The Predictive Analytics Framework
Data/Exampl
es
Feature
Extraction
Learning Algorithm
Model
New Data
Instance
Prediction
Evaluation: How well is my algorithm working ?
Model Selection: What learning Algorithm to use ?
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Important Aspects of Analytics Framework:
?Feature Engineering: Finding the
discerning characteristics
?Data Collection: Collecting the
right data / combining multiple
sources
?Cleanup: Huge effort -
noise/missing data/format
conversion...
"If you torture the data
long enough, it will
confess to anything." --
Ronald Coase
^The goal is to turn data into
information and information into
insight." -- Carly Fiorina
PAGE 13 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Regression Analysis
What ?
¢^Regression analysis is a way of finding and
representing the relationship between two or more
variables. ̄
¢Simple tool yet effective for prediction and
estimates
Why ?
¢ To predict an event/outcome using attributes or
features influencing it.
Examples
? Why UPS truck drivers don¨t take a left turn?
? Predict movie rating
PAGE 14 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Regression Analysis
How ?
The key is to arrive at equation which brings in the relationship
between the outcome and its influencing features.
It answers the questions:
? Which variables matter most or the least?
! Independent /Predictors/Features
! Dependent/Outcome
? How do those variables interact with each other?
Y = β0+β1x1+β2x2......+εMovie
Rating
Budget
Duration
PAGE 15 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Data Exploration
Identify the nature of data and pattern in underlying set
Descriptive analysis : Describes or summarizes the raw data making it more
human interpretable. It condenses data into nuggets of information
(Mean,Median)
- Missing data , when impute, when omit (R packages :Mice, VIM, Amelia)
- Nature of data distribution ( around the mean, skewness, outliers)
Data
Variable
Continuous
-Quantitative
Categorical
-Qualitative
PAGE 16 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Visualize Data Distribution
PAGE 17 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Visualization of variables relationship
- How two features/variables
are related with one-another?
? -1.00 ★ Increase in one variable
cause decrease in other
? +1.00 ★ increase in one
variable causes increase in
other
? 0 ★ is a perfect zero correlation
- Is there a redundancy?
PAGE 18 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Data Cleansing
What is cleaning
^Conversion of raw data ★ technically correct data ★ to consistent data ^
Why is cleansing important
Incorrect or inconsistent data can lead to drawing false
conclusions.
? Removal of outliers which can skew your results
? Removal of missing data
? Removal of duplicates
? Transformation of data
List of R Packages for data cleansing
MICE, Amelia, missForest, Hmisc, mi
PAGE 19 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Plotting missing data using mice package in R
Data Cleansing
PAGE 20 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Feature selection
To identify the important variables for building predictive
models which is free from ^correlated variables ̄, ^bias ̄ and
^unwanted noise ̄.
e.g. Boruta Package in R ★ Identifies important variables using Random
Forrest
PAGE 21 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Building the Model
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
R - Workshop
PAGE 23 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
R SetUP
? Copy the install binaries and packages to your
laptop
? Install R & Rstudio
? Install the Packages (ggplot2,VIM,mice,Hmisc etc)
? Copy the Model code, RDS file and the Dataset
? Set the working directory using
? Setwd(<dir where you have the script,
dataset,RDS file>)
PAGE 24 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Explore Data using R
PAGE 25 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Validate the model
? Run model against ^test ̄ data set which was set
aside to predict after training
? Check the Prediction vs Actual observed value
? (Cross)Validation is done to assess the ^fit ̄ness of
model
? Model should not under (or) over-fit future unseen
data
? Validate regression using
! R2 (higher is better)
! Residuals ( ideally should have random distribution to avoid
heteroscedasticity )
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Python - Workshop
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Basic Pipeline
1) Data loading and Inspection
2) Cleaning and Preprocessing
3) Train , Test partitioning
4) Feature Selection
5) Regression
6) Model Selection, parameter tuning, regularisation
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Data Loading
# loading imdb data into a python list format
import csv
imdb_data_csv= csv.reader(open('movie_metadata.csv'))
imdb_data=[]
for item in imdb_data_csv:
imdb_data.append(item)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Columns in Data
'color'
'director_name'
'num_critic_for_reviews'
'duration'
'director_facebook_likes'
'director_facebook_likes'
'actor_2_name'
'actor_1_facebook_likes'
'gross'
'genres'
'actor_1_name'
'movie_title'
'num_voted_users'
'cast_total_facebook_likes',
'actor_3_name',
'facenumber_in_poster',
'plot_keywords',
'movie_imdb_link',
'num_user_for_reviews',
'language',
'country',
'content_rating',
'budget',
'title_year',
'actor_2_facebook_likes',
'imdb_score',
'aspect_ratio',
'movie_facebook_likes'
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Preprocessing of data
Steps:
1) Convert text fields to numbers
2) Convert strings (numbers in CSV get read up as strings) to float or
int type
3) Remove NANs
4) Remove un-interesting columns from data
5) Feature selection
data_float = preprocessing(imdb_data)
data_np = np.array(data_float)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Train and Test data partitioning
from sklearn.model_selection import train_test_split
# remove label from data
data_np_x = np.delete(data_np, [20], axis=1)
# data partitioning
x_train, x_test, y_train, y_test = train_test_split(data_np_x,
data_np[:,20], test_size=0.25, random_state=0)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Regression
# apply regression and voila !!
from sklearn.linear_model import Ridge
regr_0 = Ridge(alpha=1.0)
regr_0.fit(x_train, y_train)
y_pred = regr_0.predict(x_test)
# model evaluation
from sklearn.metrics import mean_absolute_error
print 'absolute error: ', mean_absolute_error(y_test, y_pred)
from sklearn.metrics import mean_squared_error
print 'squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Feature Selection
Select important columns which correlate well with
output
1) Model learning and inference faster
2) Accuracy Improvement
3) Feature Selection using PCA
from sklearn.decomposition import TruncatedSVD
from copy import deepcopy
svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
data_svd = deepcopy(data_np_onehot)
data_svd = svd.fit_transform(data_svd)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Model Selection
How to select parameters of a model
Types of Regression
Popular regression models
1) Linear Regression
2) Ridge Regression: L2 Smoothing
3) Kernel regression: Higher order/non-linear
4) Lasso Regression: L1 Smoothing
5) Decision Tree regression (CART)
6) Random Forest Regression
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Ridge Regression: Regularization
Why Regularization??
-- Less Training Data:
Avoid Overfitting
-- Noisy Data: Smoothing/
Robustness to Outliers
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Ridge Regression: Regularization
# apply Ridge regression !!
from sklearn.linear_model import Ridge
regr_ridge = Ridge(alpha=10);
regr_ridge.fit(x_train, y_train)
y_pred = regr_ridge.predict(x_test)
# model evaluation
print 'ridge absolute error: ', mean_absolute_error(y_test, y_pred)
print 'ridge squared error: ',mean_squared_error(y_test, y_pred)
#Alpha determines how much smoothing/ regularization of weights we
want
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
How to select Parameter alpha?
K-fold Cross Validation:
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
How to select Parameter alpha?
K-fold Cross Validation:
verbose_level=10
from sklearn.model_selection import GridSearchCV
regr_ridge = GridSearchCV(Ridge(), cv=3, verbose=verbose_level,
param_grid={"alpha": [ 10,1,0.1]})
regr_ridge.fit(x_train, y_train)
y_pred = regr_ridge.predict(x_test)
print(regr_ridge.best_params_);
# model evaluation
print 'ridge absolute error: ', mean_absolute_error(y_test, y_pred)
print 'ridge squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Lasso regression: Feature Sparsity
Another form of Regularization with L1 Norm:
# Lasso Regression
from sklearn.linear_model import Ridge
regr_0 = Ridge(alpha=1.0)
regr_0.fit(x_train, y_train)
y_pred = regr_0.predict(x_test)
#Alpha determines how much sparsity inducing smoothing/
regularization of weights we want
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Lasso regression: Feature Sparsity
Ridge Regression Lasso Regression
Plotting the Coefficients in Ridge Regression vs Lasso Regression
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Lasso Regularization Regression
verbose_level=1
from sklearn.linear_model import Lasso
regr_ls = GridSearchCV(Lasso(), cv=2, verbose=verbose_level,
param_grid={"alpha": [ 0.01,0.1,1,10]})
regr_ls.fit(x_train, y_train)
y_pred = regr_ls.predict(x_test)
print(regr_ls.best_params_);
# model evaluation
print 'Lasso absolute error: ', mean_absolute_error(y_test, y_pred)
print 'Lasso squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Decision Tree Regression
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Decision Tree Regression: Vizualization with
depth
Depth 1 Depth 2Depth 1 Depth 5
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Decision Tree Regression
regr_dt = GridSearchCV(DecisionTreeRegressor(), cv=2, verbose=verbose_level,
param_grid={"max_depth": [ 2,3,4,5,6]})
#regr_dt = DecisionTreeRegressor(max_depth=2)
regr_dt.fit(x_train, y_train)
y_pred = regr_dt.predict(x_test)
print(regr_dt.best_params_);
# model evaluation
print 'decision tree absolute error: ', mean_absolute_error(y_test, y_pred)
print 'decsion tree squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Random Forest for Regression
--> Learn multiple Decision Trees with random partitions of data
--> Predict value as average of prediction from multiple trees
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Random Forest Regression
from sklearn.ensemble import RandomForestRegressor
regr_rf = GridSearchCV(RandomForestRegressor(), cv=2, verbose=verbose_level,
param_grid={"max_depth": [ 2,3,4,5]})
#regr_dt = DecisionTreeRegressor(max_depth=2)
regr_rf.fit(x_train, y_train)
y_pred = regr_rf.predict(x_test)
print(regr_rf.best_params_);
# model evaluation
print 'Random Forest absolute error: ', mean_absolute_error(y_test, y_pred)
print 'Random Forest squared error: ',mean_squared_error(y_test, y_pred)
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Other Forms Of Regression
# Support Vector Regression
kfold_regr = GridSearchCV(SVR(), cv=5, verbose=10, param_grid={"C": [
10,1,0.1, 1e-2], "epsilon": [ 0.05,0.1, 0.2]})
#Gaussian Process Regression
kfold_regr = GridSearchCV(GaussianProcessRegressor(kernel=None), cv=5,
verbose=10, param_grid={"alpha": [ 10,1,0.1, 1e-2]})
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Recap of Python Session
Preprocessing C
--> Feature Selection,
--> Handling missing data
--> Handling categorical data
Model Evaluation: Making training and testing data
Model Selection -
--> Find parameters : Cross validation
--> Various regression models:
a. Simple Model : Linear Regression
b. Regularization (L2 norm): Ridge regression
c. Sparse Regularization: Lasso regression
d. Interpretable C decision trees
e. Random forestsC Ensambles on Decision trees
PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Thank you

More Related Content

Predictive Analytics -Workshop

  • 1. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA #GHCI16 2016Introduction to Predictive Analytics- Hands On Workshop Using R & Python Presenters: Python Lavanya Sita Tekumalla Sharmistha Jat R Maheshwari Dhandapani Subramanian Lakshminarayanan Sowmya Venugopal Bindu
  • 2. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Agenda ?Basics of Predictive Modeling Techniques (30m) ?Hands on Workshop: Regression ? (1) Build Model : R (30m) (2) Build Model : Python(30m)
  • 3. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA What is Predictive Analytics? Learn from available data and make meaningful predictions Why Predictive Analytics? Too much data C too many scenarios... Hard for humans to explicitly describe predictive rules for all scenarios Exercise: lets predict something´ Predict how long it takes to reach home
  • 4. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Common Analytics Tasks... Supervised Learning Regression : Predict continuous target Can I predict time taken to get home from past history? Can I predict Sensex Value from past market history?
  • 5. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Common Analytics Tasks... Supervised Learning Classification : Predict the class/type of object Classify Images of Cats from Dogs from examples? Identify hand written digits by studying examples
  • 6. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Common Analytics Tasks... Unsupervised Learning Clustering : Identify groups inherent in data Given a set of news articles, what are the underlying topics or themes?
  • 7. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Predict Movie Success ??
  • 8. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Predict Movie Success: Features ?Features: CActors CDirector CGross budget CSocial media feedback CGenre and keywords CRelease date
  • 9. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Example: Predict Movie Sales? Known Data: Available advertising dollars and corresponding sales for lots of prior movies Prediction Task: For a new movie, given advertising budget C can you forecast sales ? Regression: Sales = f (Advertising budget) How to learn f ????
  • 10. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Example: Movie Hit / Flop from budget and Trailer Facebook likes? Known Data: Available budgets and facebook statistics of various hit and flop movies... Prediction Task: For a new movie, I know budget and facebook likes on trailer C what is the probability of hit ? Classification: Can I learn the Seperating Line Between hit and flop movies? Budget FacebookLikes
  • 11. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA The Predictive Analytics Framework Data/Exampl es Feature Extraction Learning Algorithm Model New Data Instance Prediction Evaluation: How well is my algorithm working ? Model Selection: What learning Algorithm to use ?
  • 12. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Important Aspects of Analytics Framework: ?Feature Engineering: Finding the discerning characteristics ?Data Collection: Collecting the right data / combining multiple sources ?Cleanup: Huge effort - noise/missing data/format conversion... "If you torture the data long enough, it will confess to anything." -- Ronald Coase ^The goal is to turn data into information and information into insight." -- Carly Fiorina
  • 13. PAGE 13 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Regression Analysis What ? ¢^Regression analysis is a way of finding and representing the relationship between two or more variables. ̄ ¢Simple tool yet effective for prediction and estimates Why ? ¢ To predict an event/outcome using attributes or features influencing it. Examples ? Why UPS truck drivers don¨t take a left turn? ? Predict movie rating
  • 14. PAGE 14 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Regression Analysis How ? The key is to arrive at equation which brings in the relationship between the outcome and its influencing features. It answers the questions: ? Which variables matter most or the least? ! Independent /Predictors/Features ! Dependent/Outcome ? How do those variables interact with each other? Y = β0+β1x1+β2x2......+εMovie Rating Budget Duration
  • 15. PAGE 15 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Data Exploration Identify the nature of data and pattern in underlying set Descriptive analysis : Describes or summarizes the raw data making it more human interpretable. It condenses data into nuggets of information (Mean,Median) - Missing data , when impute, when omit (R packages :Mice, VIM, Amelia) - Nature of data distribution ( around the mean, skewness, outliers) Data Variable Continuous -Quantitative Categorical -Qualitative
  • 16. PAGE 16 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Visualize Data Distribution
  • 17. PAGE 17 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Visualization of variables relationship - How two features/variables are related with one-another? ? -1.00 ★ Increase in one variable cause decrease in other ? +1.00 ★ increase in one variable causes increase in other ? 0 ★ is a perfect zero correlation - Is there a redundancy?
  • 18. PAGE 18 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Data Cleansing What is cleaning ^Conversion of raw data ★ technically correct data ★ to consistent data ^ Why is cleansing important Incorrect or inconsistent data can lead to drawing false conclusions. ? Removal of outliers which can skew your results ? Removal of missing data ? Removal of duplicates ? Transformation of data List of R Packages for data cleansing MICE, Amelia, missForest, Hmisc, mi
  • 19. PAGE 19 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Plotting missing data using mice package in R Data Cleansing
  • 20. PAGE 20 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Feature selection To identify the important variables for building predictive models which is free from ^correlated variables ̄, ^bias ̄ and ^unwanted noise ̄. e.g. Boruta Package in R ★ Identifies important variables using Random Forrest
  • 21. PAGE 21 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Building the Model
  • 22. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA R - Workshop
  • 23. PAGE 23 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA R SetUP ? Copy the install binaries and packages to your laptop ? Install R & Rstudio ? Install the Packages (ggplot2,VIM,mice,Hmisc etc) ? Copy the Model code, RDS file and the Dataset ? Set the working directory using ? Setwd(<dir where you have the script, dataset,RDS file>)
  • 24. PAGE 24 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Explore Data using R
  • 25. PAGE 25 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Validate the model ? Run model against ^test ̄ data set which was set aside to predict after training ? Check the Prediction vs Actual observed value ? (Cross)Validation is done to assess the ^fit ̄ness of model ? Model should not under (or) over-fit future unseen data ? Validate regression using ! R2 (higher is better) ! Residuals ( ideally should have random distribution to avoid heteroscedasticity )
  • 26. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Python - Workshop
  • 27. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Basic Pipeline 1) Data loading and Inspection 2) Cleaning and Preprocessing 3) Train , Test partitioning 4) Feature Selection 5) Regression 6) Model Selection, parameter tuning, regularisation
  • 28. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Data Loading # loading imdb data into a python list format import csv imdb_data_csv= csv.reader(open('movie_metadata.csv')) imdb_data=[] for item in imdb_data_csv: imdb_data.append(item)
  • 29. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Columns in Data 'color' 'director_name' 'num_critic_for_reviews' 'duration' 'director_facebook_likes' 'director_facebook_likes' 'actor_2_name' 'actor_1_facebook_likes' 'gross' 'genres' 'actor_1_name' 'movie_title' 'num_voted_users' 'cast_total_facebook_likes', 'actor_3_name', 'facenumber_in_poster', 'plot_keywords', 'movie_imdb_link', 'num_user_for_reviews', 'language', 'country', 'content_rating', 'budget', 'title_year', 'actor_2_facebook_likes', 'imdb_score', 'aspect_ratio', 'movie_facebook_likes'
  • 30. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Preprocessing of data Steps: 1) Convert text fields to numbers 2) Convert strings (numbers in CSV get read up as strings) to float or int type 3) Remove NANs 4) Remove un-interesting columns from data 5) Feature selection data_float = preprocessing(imdb_data) data_np = np.array(data_float)
  • 31. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Train and Test data partitioning from sklearn.model_selection import train_test_split # remove label from data data_np_x = np.delete(data_np, [20], axis=1) # data partitioning x_train, x_test, y_train, y_test = train_test_split(data_np_x, data_np[:,20], test_size=0.25, random_state=0)
  • 32. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Regression # apply regression and voila !! from sklearn.linear_model import Ridge regr_0 = Ridge(alpha=1.0) regr_0.fit(x_train, y_train) y_pred = regr_0.predict(x_test) # model evaluation from sklearn.metrics import mean_absolute_error print 'absolute error: ', mean_absolute_error(y_test, y_pred) from sklearn.metrics import mean_squared_error print 'squared error: ',mean_squared_error(y_test, y_pred)
  • 33. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Feature Selection Select important columns which correlate well with output 1) Model learning and inference faster 2) Accuracy Improvement 3) Feature Selection using PCA from sklearn.decomposition import TruncatedSVD from copy import deepcopy svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42) data_svd = deepcopy(data_np_onehot) data_svd = svd.fit_transform(data_svd)
  • 34. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Model Selection How to select parameters of a model Types of Regression Popular regression models 1) Linear Regression 2) Ridge Regression: L2 Smoothing 3) Kernel regression: Higher order/non-linear 4) Lasso Regression: L1 Smoothing 5) Decision Tree regression (CART) 6) Random Forest Regression
  • 35. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Ridge Regression: Regularization Why Regularization?? -- Less Training Data: Avoid Overfitting -- Noisy Data: Smoothing/ Robustness to Outliers
  • 36. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Ridge Regression: Regularization # apply Ridge regression !! from sklearn.linear_model import Ridge regr_ridge = Ridge(alpha=10); regr_ridge.fit(x_train, y_train) y_pred = regr_ridge.predict(x_test) # model evaluation print 'ridge absolute error: ', mean_absolute_error(y_test, y_pred) print 'ridge squared error: ',mean_squared_error(y_test, y_pred) #Alpha determines how much smoothing/ regularization of weights we want
  • 37. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA How to select Parameter alpha? K-fold Cross Validation:
  • 38. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA How to select Parameter alpha? K-fold Cross Validation: verbose_level=10 from sklearn.model_selection import GridSearchCV regr_ridge = GridSearchCV(Ridge(), cv=3, verbose=verbose_level, param_grid={"alpha": [ 10,1,0.1]}) regr_ridge.fit(x_train, y_train) y_pred = regr_ridge.predict(x_test) print(regr_ridge.best_params_); # model evaluation print 'ridge absolute error: ', mean_absolute_error(y_test, y_pred) print 'ridge squared error: ',mean_squared_error(y_test, y_pred)
  • 39. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Lasso regression: Feature Sparsity Another form of Regularization with L1 Norm: # Lasso Regression from sklearn.linear_model import Ridge regr_0 = Ridge(alpha=1.0) regr_0.fit(x_train, y_train) y_pred = regr_0.predict(x_test) #Alpha determines how much sparsity inducing smoothing/ regularization of weights we want
  • 40. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Lasso regression: Feature Sparsity Ridge Regression Lasso Regression Plotting the Coefficients in Ridge Regression vs Lasso Regression
  • 41. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Lasso Regularization Regression verbose_level=1 from sklearn.linear_model import Lasso regr_ls = GridSearchCV(Lasso(), cv=2, verbose=verbose_level, param_grid={"alpha": [ 0.01,0.1,1,10]}) regr_ls.fit(x_train, y_train) y_pred = regr_ls.predict(x_test) print(regr_ls.best_params_); # model evaluation print 'Lasso absolute error: ', mean_absolute_error(y_test, y_pred) print 'Lasso squared error: ',mean_squared_error(y_test, y_pred)
  • 42. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Decision Tree Regression
  • 43. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Decision Tree Regression: Vizualization with depth Depth 1 Depth 2Depth 1 Depth 5
  • 44. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Decision Tree Regression regr_dt = GridSearchCV(DecisionTreeRegressor(), cv=2, verbose=verbose_level, param_grid={"max_depth": [ 2,3,4,5,6]}) #regr_dt = DecisionTreeRegressor(max_depth=2) regr_dt.fit(x_train, y_train) y_pred = regr_dt.predict(x_test) print(regr_dt.best_params_); # model evaluation print 'decision tree absolute error: ', mean_absolute_error(y_test, y_pred) print 'decsion tree squared error: ',mean_squared_error(y_test, y_pred)
  • 45. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Random Forest for Regression --> Learn multiple Decision Trees with random partitions of data --> Predict value as average of prediction from multiple trees
  • 46. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Random Forest Regression from sklearn.ensemble import RandomForestRegressor regr_rf = GridSearchCV(RandomForestRegressor(), cv=2, verbose=verbose_level, param_grid={"max_depth": [ 2,3,4,5]}) #regr_dt = DecisionTreeRegressor(max_depth=2) regr_rf.fit(x_train, y_train) y_pred = regr_rf.predict(x_test) print(regr_rf.best_params_); # model evaluation print 'Random Forest absolute error: ', mean_absolute_error(y_test, y_pred) print 'Random Forest squared error: ',mean_squared_error(y_test, y_pred)
  • 47. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Other Forms Of Regression # Support Vector Regression kfold_regr = GridSearchCV(SVR(), cv=5, verbose=10, param_grid={"C": [ 10,1,0.1, 1e-2], "epsilon": [ 0.05,0.1, 0.2]}) #Gaussian Process Regression kfold_regr = GridSearchCV(GaussianProcessRegressor(kernel=None), cv=5, verbose=10, param_grid={"alpha": [ 10,1,0.1, 1e-2]})
  • 48. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Recap of Python Session Preprocessing C --> Feature Selection, --> Handling missing data --> Handling categorical data Model Evaluation: Making training and testing data Model Selection - --> Find parameters : Cross validation --> Various regression models: a. Simple Model : Linear Regression b. Regularization (L2 norm): Ridge regression c. Sparse Regularization: Lasso regression d. Interpretable C decision trees e. Random forestsC Ensambles on Decision trees
  • 49. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16 PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA Thank you