�ݺ�ߣ

PAGE | GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
PRESENTED BY THE ANITA BORG INSTITUTE AND THE ASSOCIATION FOR COMPUTING MACHINERY INDIA
#GHCI16
2016Introduction to Predictive Analytics-
Hands On Workshop Using R &
Python
Presenters:
Python
Lavanya Sita Tekumalla
Sharmistha Jat
R
Maheshwari Dhandapani
Subramanian Lakshminarayanan
Sowmya Venugopal
Bindu

Agenda
?Basics of Predictive Modeling Techniques (30m)
?Hands on Workshop: Regression
? (1) Build Model : R (30m) (2) Build Model : Python(30m)

What is Predictive Analytics?
Learn from available data and make meaningful
predictions
Why Predictive Analytics?
Too much data �C too many scenarios...
Hard for humans to explicitly describe predictive rules
for all scenarios
Exercise: lets predict something��
Predict how long it takes to reach home

Common Analytics Tasks...
Supervised Learning
Regression : Predict continuous target
Can I predict time taken to get home from past history?
Can I predict Sensex Value from past market history?

Supervised Learning
Classification : Predict the class/type of object
Classify Images of Cats from Dogs from examples?
Identify hand written digits by studying examples

Unsupervised Learning
Clustering : Identify groups inherent in data
Given a set of news articles, what are the underlying topics or themes?

Predict Movie Success ??

Predict Movie Success: Features
?Features:
�CActors
�CDirector
�CGross budget
�CSocial media feedback
�CGenre and keywords
�CRelease date

Example: Predict Movie Sales?
Known Data:
Available advertising dollars
and corresponding sales for
lots of prior movies
Prediction Task:
For a new movie, given
advertising budget �C can you
forecast sales ?
Regression:
Sales = f (Advertising budget)
How to learn f ????

Example: Movie Hit / Flop from budget and
Trailer Facebook likes?
Known Data:
Available budgets and
facebook statistics of various
hit and flop movies...
Prediction Task:
For a new movie, I know budget
and facebook likes on trailer �C
what is the probability of hit ?
Classification:
Can I learn the Seperating Line
Between hit and flop movies? Budget
FacebookLikes

The Predictive Analytics Framework
Data/Exampl
es
Feature
Extraction
Learning Algorithm
Model
New Data
Instance
Prediction
Evaluation: How well is my algorithm working ?
Model Selection: What learning Algorithm to use ?

Important Aspects of Analytics Framework:
?Feature Engineering: Finding the
discerning characteristics
?Data Collection: Collecting the
right data / combining multiple
sources
?Cleanup: Huge effort -
noise/missing data/format
conversion...
"If you torture the data
long enough, it will
confess to anything." --
Ronald Coase
��The goal is to turn data into
information and information into
insight." -- Carly Fiorina

| GRACE HOPPER CELEBRATION INDIA 2016 | #GHCI16
Regression Analysis
What ?
��Regression analysis is a way of finding and
representing the relationship between two or more
variables.��
��Simple tool yet effective for prediction and
estimates
Why ?
�� To predict an event/outcome using attributes or
features influencing it.
Examples
? Why UPS truck drivers don��t take a left turn?
? Predict movie rating

Regression Analysis
How ?
The key is to arrive at equation which brings in the relationship
between the outcome and its influencing features.
It answers the questions:
? Which variables matter most or the least?
�� Independent /Predictors/Features
�� Dependent/Outcome
? How do those variables interact with each other?
Y = ��0+��1x1+��2x2......+��Movie
Rating
Budget
Duration

Data Exploration
Identify the nature of data and pattern in underlying set
Descriptive analysis : Describes or summarizes the raw data making it more
human interpretable. It condenses data into nuggets of information
(Mean,Median)
- Missing data , when impute, when omit (R packages :Mice, VIM, Amelia)
- Nature of data distribution ( around the mean, skewness, outliers)
Data
Variable
Continuous
-Quantitative
Categorical
-Qualitative

Visualize Data Distribution

Visualization of variables relationship
- How two features/variables
are related with one-another?
? -1.00 �� Increase in one variable
cause decrease in other
? +1.00 �� increase in one
variable causes increase in
other
? 0 �� is a perfect zero correlation
- Is there a redundancy?

Data Cleansing
What is cleaning
��Conversion of raw data �� technically correct data �� to consistent data ��
Why is cleansing important
Incorrect or inconsistent data can lead to drawing false
conclusions.
? Removal of outliers which can skew your results
? Removal of missing data
? Removal of duplicates
? Transformation of data
List of R Packages for data cleansing
MICE, Amelia, missForest, Hmisc, mi

Plotting missing data using mice package in R
Data Cleansing

Feature selection
To identify the important variables for building predictive
models which is free from ��correlated variables��, ��bias�� and
��unwanted noise��.
e.g. Boruta Package in R �� Identifies important variables using Random
Forrest

Building the Model

R - Workshop

R SetUP
? Copy the install binaries and packages to your
laptop
? Install R & Rstudio
? Install the Packages (ggplot2,VIM,mice,Hmisc etc)
? Copy the Model code, RDS file and the Dataset
? Set the working directory using
? Setwd(<dir where you have the script,
dataset,RDS file>)

Explore Data using R

Validate the model
? Run model against ��test�� data set which was set
aside to predict after training
? Check the Prediction vs Actual observed value
? (Cross)Validation is done to assess the ��fit��ness of
model
? Model should not under (or) over-fit future unseen
data
? Validate regression using
�� R2 (higher is better)
�� Residuals ( ideally should have random distribution to avoid
heteroscedasticity )

Python - Workshop

Basic Pipeline
1) Data loading and Inspection
2) Cleaning and Preprocessing
3) Train , Test partitioning
4) Feature Selection
5) Regression
6) Model Selection, parameter tuning, regularisation

Data Loading
# loading imdb data into a python list format
import csv
imdb_data_csv= csv.reader(open('movie_metadata.csv'))
imdb_data=[]
for item in imdb_data_csv:
imdb_data.append(item)

Columns in Data
'color'
'director_name'
'num_critic_for_reviews'
'duration'
'director_facebook_likes'
'director_facebook_likes'
'actor_2_name'
'actor_1_facebook_likes'
'gross'
'genres'
'actor_1_name'
'movie_title'
'num_voted_users'
'cast_total_facebook_likes',
'actor_3_name',
'facenumber_in_poster',
'plot_keywords',
'movie_imdb_link',
'num_user_for_reviews',
'language',
'country',
'content_rating',
'budget',
'title_year',
'actor_2_facebook_likes',
'imdb_score',
'aspect_ratio',
'movie_facebook_likes'

Preprocessing of data
Steps:
1) Convert text fields to numbers
2) Convert strings (numbers in CSV get read up as strings) to float or
int type
3) Remove NANs
4) Remove un-interesting columns from data
5) Feature selection
data_float = preprocessing(imdb_data)
data_np = np.array(data_float)

Train and Test data partitioning
from sklearn.model_selection import train_test_split
# remove label from data
data_np_x = np.delete(data_np, [20], axis=1)
# data partitioning
x_train, x_test, y_train, y_test = train_test_split(data_np_x,
data_np[:,20], test_size=0.25, random_state=0)

Regression
# apply regression and voila !!
from sklearn.linear_model import Ridge
regr_0 = Ridge(alpha=1.0)
regr_0.fit(x_train, y_train)
y_pred = regr_0.predict(x_test)
# model evaluation
from sklearn.metrics import mean_absolute_error
print 'absolute error: ', mean_absolute_error(y_test, y_pred)
from sklearn.metrics import mean_squared_error
print 'squared error: ',mean_squared_error(y_test, y_pred)

Feature Selection
Select important columns which correlate well with
output
1) Model learning and inference faster
2) Accuracy Improvement
3) Feature Selection using PCA
from sklearn.decomposition import TruncatedSVD
from copy import deepcopy
svd = TruncatedSVD(n_components=5, n_iter=7, random_state=42)
data_svd = deepcopy(data_np_onehot)
data_svd = svd.fit_transform(data_svd)

Model Selection
How to select parameters of a model
Types of Regression
Popular regression models
1) Linear Regression
2) Ridge Regression: L2 Smoothing
3) Kernel regression: Higher order/non-linear
4) Lasso Regression: L1 Smoothing
5) Decision Tree regression (CART)
6) Random Forest Regression

Ridge Regression: Regularization
Why Regularization??
-- Less Training Data:
Avoid Overfitting
-- Noisy Data: Smoothing/
Robustness to Outliers

Ridge Regression: Regularization
# apply Ridge regression !!
regr_ridge = Ridge(alpha=10);
regr_ridge.fit(x_train, y_train)
y_pred = regr_ridge.predict(x_test)
# model evaluation
print 'ridge absolute error: ', mean_absolute_error(y_test, y_pred)
print 'ridge squared error: ',mean_squared_error(y_test, y_pred)
#Alpha determines how much smoothing/ regularization of weights we
want

How to select Parameter alpha?
K-fold Cross Validation:

How to select Parameter alpha?
K-fold Cross Validation:
verbose_level=10
from sklearn.model_selection import GridSearchCV
regr_ridge = GridSearchCV(Ridge(), cv=3, verbose=verbose_level,
param_grid={"alpha": [ 10,1,0.1]})
regr_ridge.fit(x_train, y_train)
y_pred = regr_ridge.predict(x_test)
print(regr_ridge.best_params_);
# model evaluation
print 'ridge absolute error: ', mean_absolute_error(y_test, y_pred)
print 'ridge squared error: ',mean_squared_error(y_test, y_pred)

Lasso regression: Feature Sparsity
Another form of Regularization with L1 Norm:
# Lasso Regression
regr_0 = Ridge(alpha=1.0)
regr_0.fit(x_train, y_train)
y_pred = regr_0.predict(x_test)
#Alpha determines how much sparsity inducing smoothing/
regularization of weights we want

Lasso regression: Feature Sparsity
Ridge Regression Lasso Regression
Plotting the Coefficients in Ridge Regression vs Lasso Regression

Lasso Regularization Regression
verbose_level=1
from sklearn.linear_model import Lasso
regr_ls = GridSearchCV(Lasso(), cv=2, verbose=verbose_level,
param_grid={"alpha": [ 0.01,0.1,1,10]})
regr_ls.fit(x_train, y_train)
y_pred = regr_ls.predict(x_test)
print(regr_ls.best_params_);
# model evaluation
print 'Lasso absolute error: ', mean_absolute_error(y_test, y_pred)
print 'Lasso squared error: ',mean_squared_error(y_test, y_pred)

Decision Tree Regression

Decision Tree Regression: Vizualization with
depth
Depth 1 Depth 2Depth 1 Depth 5

Decision Tree Regression
regr_dt = GridSearchCV(DecisionTreeRegressor(), cv=2, verbose=verbose_level,
param_grid={"max_depth": [ 2,3,4,5,6]})
#regr_dt = DecisionTreeRegressor(max_depth=2)
regr_dt.fit(x_train, y_train)
y_pred = regr_dt.predict(x_test)
print(regr_dt.best_params_);
# model evaluation
print 'decision tree absolute error: ', mean_absolute_error(y_test, y_pred)
print 'decsion tree squared error: ',mean_squared_error(y_test, y_pred)

Random Forest for Regression
--> Learn multiple Decision Trees with random partitions of data
--> Predict value as average of prediction from multiple trees

Random Forest Regression
from sklearn.ensemble import RandomForestRegressor
regr_rf = GridSearchCV(RandomForestRegressor(), cv=2, verbose=verbose_level,
param_grid={"max_depth": [ 2,3,4,5]})
#regr_dt = DecisionTreeRegressor(max_depth=2)
regr_rf.fit(x_train, y_train)
y_pred = regr_rf.predict(x_test)
print(regr_rf.best_params_);
# model evaluation
print 'Random Forest absolute error: ', mean_absolute_error(y_test, y_pred)
print 'Random Forest squared error: ',mean_squared_error(y_test, y_pred)

Other Forms Of Regression
# Support Vector Regression
kfold_regr = GridSearchCV(SVR(), cv=5, verbose=10, param_grid={"C": [
10,1,0.1, 1e-2], "epsilon": [ 0.05,0.1, 0.2]})
#Gaussian Process Regression
kfold_regr = GridSearchCV(GaussianProcessRegressor(kernel=None), cv=5,
verbose=10, param_grid={"alpha": [ 10,1,0.1, 1e-2]})

Recap of Python Session
Preprocessing �C
--> Feature Selection,
--> Handling missing data
--> Handling categorical data
Model Evaluation: Making training and testing data
Model Selection -
--> Find parameters : Cross validation
--> Various regression models:
a. Simple Model : Linear Regression
b. Regularization (L2 norm): Ridge regression
c. Sparse Regularization: Lasso regression
d. Interpretable �C decision trees
e. Random forests�C Ensambles on Decision trees

Thank you

�ݺ�ߣ

Predictive Analytics -Workshop

More Related Content

Predictive Analytics -Workshop