Presentation Regression -Predictive analysis using R and Python on 8 December at GHCI16, Bangalore
http://ghcischedule.anitaborg.org/session/predictive-modeling-using-r-and-python/
1 of 49
Downloaded 19 times
More Related Content
Predictive Analytics -Workshop
1. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
#GHCI16
2016Introduction to Predictive Analytics-
Hands On Workshop Using R &
Python
Presenters:
Python
Lavanya Sita Tekumalla
Sharmistha Jat
R
Maheshwari Dhandapani
Subramanian Lakshminarayanan
Sowmya Venugopal
Bindu
2. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Agenda
?Basics of Predictive Modeling Techniques (30m)
?Hands on Workshop: Regression
? (1) Build Model : R (30m) (2) Build Model : Python(30m)
3. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
What is Predictive Analytics?
Learn from available data and make meaningful
predictions
Why Predictive Analytics?
Too much data C too many scenarios...
Hard for humans to explicitly describe predictive rules
for all scenarios
Exercise: lets predict something´
Predict how long it takes to reach home
4. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Common Analytics Tasks...
Supervised Learning
Regression : Predict continuous target
Can I predict time taken to get home from past history?
Can I predict Sensex Value from past market history?
5. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Common Analytics Tasks...
Supervised Learning
Classification : Predict the class/type of object
Classify Images of Cats from Dogs from examples?
Identify hand written digits by studying examples
6. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Common Analytics Tasks...
Unsupervised Learning
Clustering : Identify groups inherent in data
Given a set of news articles, what are the underlying topics or themes?
7. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Predict Movie Success ??
8. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Predict Movie Success: Features
?Features:
CActors
CDirector
CGross budget
CSocial media feedback
CGenre and keywords
CRelease date
9. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Example: Predict Movie Sales?
Known Data:
Available advertising dollars
and corresponding sales for
lots of prior movies
Prediction Task:
For a new movie, given
advertising budget C can you
forecast sales ?
Regression:
Sales = f (Advertising budget)
How to learn f ????
10. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Example: Movie Hit / Flop from budget and
Trailer Facebook likes?
Known Data:
Available budgets and
facebook statistics of various
hit and flop movies...
Prediction Task:
For a new movie, I know budget
and facebook likes on trailer C
what is the probability of hit ?
Classification:
Can I learn the Seperating Line
Between hit and flop movies? Budget
FacebookLikes
11. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
The Predictive Analytics Framework
Data/Exampl
es
Feature
Extraction
Learning Algorithm
Model
New Data
Instance
Prediction
Evaluation: How well is my algorithm working ?
Model Selection: What learning Algorithm to use ?
12. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Important Aspects of Analytics Framework:
?Feature Engineering: Finding the
discerning characteristics
?Data Collection: Collecting the
right data / combining multiple
sources
?Cleanup: Huge effort -
noise/missing data/format
conversion...
"If you torture the data
long enough, it will
confess to anything." --
Ronald Coase
^The goal is to turn data into
information and information into
insight." -- Carly Fiorina
13. PAGE 13 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Regression Analysis
What ?
¢^Regression analysis is a way of finding and
representing the relationship between two or more
variables. ̄
¢Simple tool yet effective for prediction and
estimates
Why ?
¢ To predict an event/outcome using attributes or
features influencing it.
Examples
? Why UPS truck drivers don¨t take a left turn?
? Predict movie rating
14. PAGE 14 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Regression Analysis
How ?
The key is to arrive at equation which brings in the relationship
between the outcome and its influencing features.
It answers the questions:
? Which variables matter most or the least?
! Independent /Predictors/Features
! Dependent/Outcome
? How do those variables interact with each other?
Y = β0+β1x1+β2x2......+εMovie
Rating
Budget
Duration
15. PAGE 15 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Data Exploration
Identify the nature of data and pattern in underlying set
Descriptive analysis : Describes or summarizes the raw data making it more
human interpretable. It condenses data into nuggets of information
(Mean,Median)
- Missing data , when impute, when omit (R packages :Mice, VIM, Amelia)
- Nature of data distribution ( around the mean, skewness, outliers)
Data
Variable
Continuous
-Quantitative
Categorical
-Qualitative
16. PAGE 16 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Visualize Data Distribution
17. PAGE 17 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Visualization of variables relationship
- How two features/variables
are related with one-another?
? -1.00 ★ Increase in one variable
cause decrease in other
? +1.00 ★ increase in one
variable causes increase in
other
? 0 ★ is a perfect zero correlation
- Is there a redundancy?
18. PAGE 18 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Data Cleansing
What is cleaning
^Conversion of raw data ★ technically correct data ★ to consistent data ^
Why is cleansing important
Incorrect or inconsistent data can lead to drawing false
conclusions.
? Removal of outliers which can skew your results
? Removal of missing data
? Removal of duplicates
? Transformation of data
List of R Packages for data cleansing
MICE, Amelia, missForest, Hmisc, mi
19. PAGE 19 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Plotting missing data using mice package in R
Data Cleansing
20. PAGE 20 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Feature selection
To identify the important variables for building predictive
models which is free from ^correlated variables ̄, ^bias ̄ and
^unwanted noise ̄.
e.g. Boruta Package in R ★ Identifies important variables using Random
Forrest
21. PAGE 21 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Building the Model
22. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
R - Workshop
23. PAGE 23 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
R SetUP
? Copy the install binaries and packages to your
laptop
? Install R & Rstudio
? Install the Packages (ggplot2,VIM,mice,Hmisc etc)
? Copy the Model code, RDS file and the Dataset
? Set the working directory using
? Setwd(<dir where you have the script,
dataset,RDS file>)
24. PAGE 24 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Explore Data using R
25. PAGE 25 | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Validate the model
? Run model against ^test ̄ data set which was set
aside to predict after training
? Check the Prediction vs Actual observed value
? (Cross)Validation is done to assess the ^fit ̄ness of
model
? Model should not under (or) over-fit future unseen
data
? Validate regression using
! R2 (higher is better)
! Residuals ( ideally should have random distribution to avoid
heteroscedasticity )
26. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Python - Workshop
27. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Basic Pipeline
1) Data loading and Inspection
2) Cleaning and Preprocessing
3) Train , Test partitioning
4) Feature Selection
5) Regression
6) Model Selection, parameter tuning, regularisation
28. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Data Loading
# loading imdb data into a python list format
import csv
imdb_data_csv= csv.reader(open('movie_metadata.csv'))
imdb_data=[]
for item in imdb_data_csv:
imdb_data.append(item)
29. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Columns in Data
'color'
'director_name'
'num_critic_for_reviews'
'duration'
'director_facebook_likes'
'director_facebook_likes'
'actor_2_name'
'actor_1_facebook_likes'
'gross'
'genres'
'actor_1_name'
'movie_title'
'num_voted_users'
'cast_total_facebook_likes',
'actor_3_name',
'facenumber_in_poster',
'plot_keywords',
'movie_imdb_link',
'num_user_for_reviews',
'language',
'country',
'content_rating',
'budget',
'title_year',
'actor_2_facebook_likes',
'imdb_score',
'aspect_ratio',
'movie_facebook_likes'
30. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Preprocessing of data
Steps:
1) Convert text fields to numbers
2) Convert strings (numbers in CSV get read up as strings) to float or
int type
3) Remove NANs
4) Remove un-interesting columns from data
5) Feature selection
data_float = preprocessing(imdb_data)
data_np = np.array(data_float)
31. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Train and Test data partitioning
from sklearn.model_selection import train_test_split
# remove label from data
data_np_x = np.delete(data_np, [20], axis=1)
# data partitioning
x_train, x_test, y_train, y_test = train_test_split(data_np_x,
data_np[:,20], test_size=0.25, random_state=0)
32. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Regression
# apply regression and voila !!
from sklearn.linear_model import Ridge
regr_0 = Ridge(alpha=1.0)
regr_0.fit(x_train, y_train)
y_pred = regr_0.predict(x_test)
# model evaluation
from sklearn.metrics import mean_absolute_error
print 'absolute error: ', mean_absolute_error(y_test, y_pred)
from sklearn.metrics import mean_squared_error
print 'squared error: ',mean_squared_error(y_test, y_pred)
33. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Feature Selection
Select important columns which correlate well with
output
1) Model learning and inference faster
2) Accuracy Improvement
3) Feature Selection using PCA
from sklearn.decomposition import TruncatedSVD
from copy import deepcopy
svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
data_svd = deepcopy(data_np_onehot)
data_svd = svd.fit_transform(data_svd)
34. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Model Selection
How to select parameters of a model
Types of Regression
Popular regression models
1) Linear Regression
2) Ridge Regression: L2 Smoothing
3) Kernel regression: Higher order/non-linear
4) Lasso Regression: L1 Smoothing
5) Decision Tree regression (CART)
6) Random Forest Regression
35. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Ridge Regression: Regularization
Why Regularization??
-- Less Training Data:
Avoid Overfitting
-- Noisy Data: Smoothing/
Robustness to Outliers
36. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Ridge Regression: Regularization
# apply Ridge regression !!
from sklearn.linear_model import Ridge
regr_ridge = Ridge(alpha=10);
regr_ridge.fit(x_train, y_train)
y_pred = regr_ridge.predict(x_test)
# model evaluation
print 'ridge absolute error: ', mean_absolute_error(y_test, y_pred)
print 'ridge squared error: ',mean_squared_error(y_test, y_pred)
#Alpha determines how much smoothing/ regularization of weights we
want
37. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
How to select Parameter alpha?
K-fold Cross Validation:
38. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
How to select Parameter alpha?
K-fold Cross Validation:
verbose_level=10
from sklearn.model_selection import GridSearchCV
regr_ridge = GridSearchCV(Ridge(), cv=3, verbose=verbose_level,
param_grid={"alpha": [ 10,1,0.1]})
regr_ridge.fit(x_train, y_train)
y_pred = regr_ridge.predict(x_test)
print(regr_ridge.best_params_);
# model evaluation
print 'ridge absolute error: ', mean_absolute_error(y_test, y_pred)
print 'ridge squared error: ',mean_squared_error(y_test, y_pred)
39. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Lasso regression: Feature Sparsity
Another form of Regularization with L1 Norm:
# Lasso Regression
from sklearn.linear_model import Ridge
regr_0 = Ridge(alpha=1.0)
regr_0.fit(x_train, y_train)
y_pred = regr_0.predict(x_test)
#Alpha determines how much sparsity inducing smoothing/
regularization of weights we want
40. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Lasso regression: Feature Sparsity
Ridge Regression Lasso Regression
Plotting the Coefficients in Ridge Regression vs Lasso Regression
41. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Lasso Regularization Regression
verbose_level=1
from sklearn.linear_model import Lasso
regr_ls = GridSearchCV(Lasso(), cv=2, verbose=verbose_level,
param_grid={"alpha": [ 0.01,0.1,1,10]})
regr_ls.fit(x_train, y_train)
y_pred = regr_ls.predict(x_test)
print(regr_ls.best_params_);
# model evaluation
print 'Lasso absolute error: ', mean_absolute_error(y_test, y_pred)
print 'Lasso squared error: ',mean_squared_error(y_test, y_pred)
42. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Decision Tree Regression
43. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Decision Tree Regression: Vizualization with
depth
Depth 1 Depth 2Depth 1 Depth 5
44. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Decision Tree Regression
regr_dt = GridSearchCV(DecisionTreeRegressor(), cv=2, verbose=verbose_level,
param_grid={"max_depth": [ 2,3,4,5,6]})
#regr_dt = DecisionTreeRegressor(max_depth=2)
regr_dt.fit(x_train, y_train)
y_pred = regr_dt.predict(x_test)
print(regr_dt.best_params_);
# model evaluation
print 'decision tree absolute error: ', mean_absolute_error(y_test, y_pred)
print 'decsion tree squared error: ',mean_squared_error(y_test, y_pred)
45. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Random Forest for Regression
--> Learn multiple Decision Trees with random partitions of data
--> Predict value as average of prediction from multiple trees
46. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Random Forest Regression
from sklearn.ensemble import RandomForestRegressor
regr_rf = GridSearchCV(RandomForestRegressor(), cv=2, verbose=verbose_level,
param_grid={"max_depth": [ 2,3,4,5]})
#regr_dt = DecisionTreeRegressor(max_depth=2)
regr_rf.fit(x_train, y_train)
y_pred = regr_rf.predict(x_test)
print(regr_rf.best_params_);
# model evaluation
print 'Random Forest absolute error: ', mean_absolute_error(y_test, y_pred)
print 'Random Forest squared error: ',mean_squared_error(y_test, y_pred)
47. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Other Forms Of Regression
# Support Vector Regression
kfold_regr = GridSearchCV(SVR(), cv=5, verbose=10, param_grid={"C": [
10,1,0.1, 1e-2], "epsilon": [ 0.05,0.1, 0.2]})
#Gaussian Process Regression
kfold_regr = GridSearchCV(GaussianProcessRegressor(kernel=None), cv=5,
verbose=10, param_grid={"alpha": [ 10,1,0.1, 1e-2]})
48. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Recap of Python Session
Preprocessing C
--> Feature Selection,
--> Handling missing data
--> Handling categorical data
Model Evaluation: Making training and testing data
Model Selection -
--> Find parameters : Cross validation
--> Various regression models:
a. Simple Model : Linear Regression
b. Regularization (L2 norm): Ridge regression
c. Sparse Regularization: Lasso regression
d. Interpretable C decision trees
e. Random forestsC Ensambles on Decision trees
49. PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
Thank you