ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
2015 Analytic Challenge
KA RA N SA RA O
TEAM
 Karan Sarao
ANALYTIC SOFTWARE USED
 Data Preparation – SAS
 Model Building – R
 Hardware
– Acer Aspire 5750
– 6 GB RAM
SOLUTION OVERVIEW
Data Preparation
Missing Value Treatment
•Nominal – New Category
•Numeric/Ordinal – Replace with 0 (Value)
New Variable Creation
•Multiple derived Variables
Model Tuning and
Stacking
Training / Blending /Testing Split
Caret Function to tune Multiple
Model parameters
Stacking and Testing to optimize
sequence
Final Modeling
2 Stage Modeling process adopted
Initial set of optimized models
created in Stage 1
Scores incorporated into final blended
Model in Stage 2
Scoring
2 Stage scoring process followed
Model Tuning Process
Stage 1 ModelingData Splitting Stage 2 Modeling Evaluation
Phase
Modeling Data Set –
Random Assignment
50% ofObservations
30% ofObservations
20 % of
Observations
Stage 1 Models
ï‚· Model 1
ï‚· Model 2
ï‚· Model 3
ï‚· Model 4
ï‚· Model 5
Scoreall 5 Models
on Stage 2 Data,
append scores as
new variables
Stage 2 Models
ï‚· Model 1
ï‚· Model 2
ï‚· Model 3
ï‚· Model 4
ï‚· Model 5
Run Stage 1 Models
Run Stage 2 Models
Compare
performance of all
Stage 2 Models
SOLUTION OVERVIEW – Continued (Model Tuning)
DATA TRANSFORMATIONS
 Mix of Linear and Non Linear (Tree Based) Models
‒ Cover each others weakness
‒ Tree based models are invariant to order preserving transformations (no need for Log/Exponent etc.)
 More focus on feature engineering, new variables created as below 
‒ SHIP_RATIO  (ORDER_SH_AMT+ORDER_ADDL_SH_AMT)/ORDER_GROSS_AMT (Does shipping cost as a ratio of the initial
order have any influence)
‒ PAYMT_RATIO=(ORDER_SH_AMT+ORDER_ADDL_SH_AMT+ORDER_GROSS_AMT)/PAYMENT_QTY (What is amount of each
payment)
‒ REV_RATIO=TOTAL_REV_PRIOR_TO_A/TENURE (Revenue ratio per unit tenure)
‒ REV_PER_ORDER=TOTAL_REV_PRIOR_TO_A/TOTAL_ORDERS_PRIOR_TO_A (Revenue per order)
‒ FIRST_ORDER_RATIO=ORDER_GROSS_AMT/ITEM_QTY
‒ FIRST_PAYMENT_RATIO=ORDER_GROSS_AMT/PAYMENT_QTY
‒ ORDER_FREQ=TENURE/TOTAL_ORDERS_PRIOR_TO_A
‒ ORDER_DUE_RATIO=RECENCY/ORDER_FREQ
‒ ORDER_DUE_RATIO_2=(RECENCY-ORDER_FREQ)/ORDER_FREQ
‒ ORDER_DUE_RATIO_3=(RECENCY-ORDER_FREQ)/RECENCY
‒ All divide by zero exceptions set to 0
Multiple Models trained on 50% of the data
 Random Forests (randomForest)
 AdaBoost (ada)
 Gradient Boosting Machines (gbm)
 eXtreme Gradient Boost (xgboost)
 Logistic Regression (variables selected by studying glmnet output)
 Regularized Logistic Regression (glmnet)
Several of the above models have tunable parameters
 Caret package in R used to cycle through various combinations of input parameters
using multiple folds
 Problem statement specifies rank order primacy, hence ROC metric maximized
Stage 1 Models
 All 5 Models built in stage 1 used to score both Stage 2 and evaluation data
 5 score columns added back to the data set (stage 2 and evaluation)
 4 Models created again on Stage 2 dataset
 Stage 1 and Stage 2 models are scored on evaluation dataset
 ROC (AUC) calculated for the models on evaluation dataset
 Best Model identified – xgboost (Stage 2)
Model Stage 1 (AUC)
On EvaluationSet
Stage 2 (AUC)
On EvaluationSet
xgboost 0.646 0.647
logit 0.641 0.646
gbm 0.636 0.644
glmnet 0.641 0.642
ada 0.637 0.642
random forest 0.617 NA
Stage 2 Models
 Data split as 50-50 between Stage 1 modeling and Stage 2 blending
 Xgboost used to blend in Stage 2
 Initial 5 models score the submission dataset and scores merged
back to create dataset for sixth model
 Blend Model used to generate the final submission score
Final Model Building
Important Variables
TXN_CHANNEL_CD
PAYMENT_QTY
RUSH_ORD_FLAG
SHIP_RATIO
FIRST_ORDER_RATIO
DEMOGRAPHIC_SEGMENT
ORDER_GROSS_AMT
RETAIL/CATALOG_SPENDING_QUINTILE
REV_PER_ORDER
HH_INCOME
PAYMT_RATIO
ETHNICITY
LANGUAGE
 Mix of ready and derived variables
 Ranking of top variables can be difficult
to quantify across multiple modeling
techniques/blends
 Plain logistic regression with these
variables can create a Model with
comparable performance (~.64 AUC)
TOP VARIABLES
 Derived Variables
‒ Create as many behavioral/pattern variables as possible
‒ Ratios such as revenue/order, order frequency, shipping cost to total cost etc.
 Cross Validation for controlling overfit
‒ K fold (maximum possible) validation runs
‒ Tune parameters (control depth and boosting rounds to maximize test ROC)
‒ Use grid search for optimum parameter search or employ Caret package
KEYS TO SUCCESS

More Related Content

Similar to DMA Analytics Challenge 2015 (Winner - First Position) (20)

AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
IRJET Journal
Ìý
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
Yao Yao
Ìý
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...Compensator Design for Speed Control of DC Motor by Root Locus Approach using...
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...
IRJET Journal
Ìý
BAPI - Criação de Ordem de Manutenção
BAPI - Criação de Ordem de ManutençãoBAPI - Criação de Ordem de Manutenção
BAPI - Criação de Ordem de Manutenção
Roberto Fernandes Ferreira
Ìý
Oracle_Analytical_function.pdf
Oracle_Analytical_function.pdfOracle_Analytical_function.pdf
Oracle_Analytical_function.pdf
KalyankumarVenkat1
Ìý
Lecture16_Process Analyzer and OPTQUEST.ppt
Lecture16_Process Analyzer and OPTQUEST.pptLecture16_Process Analyzer and OPTQUEST.ppt
Lecture16_Process Analyzer and OPTQUEST.ppt
AbdAbd72
Ìý
Machine Learning Foundations Project Presentation
Machine Learning Foundations Project PresentationMachine Learning Foundations Project Presentation
Machine Learning Foundations Project Presentation
Amit J Bhattacharyya
Ìý
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...
IRJET Journal
Ìý
GPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
GPU Accelerated Backtesting and Machine Learning for Quant Trading StrategiesGPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
GPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
Daniel Egloff
Ìý
SQL Server Query Optimization, Execution and Debugging Query Performance
SQL Server Query Optimization, Execution and Debugging Query PerformanceSQL Server Query Optimization, Execution and Debugging Query Performance
SQL Server Query Optimization, Execution and Debugging Query Performance
Vinod Kumar
Ìý
1 2 chem plantdesign-intro to plant design economics
1 2 chem plantdesign-intro to plant design  economics1 2 chem plantdesign-intro to plant design  economics
1 2 chem plantdesign-intro to plant design economics
ayimsevenfold
Ìý
Energy-efficient technology investments using a decision support system frame...
Energy-efficient technology investments using a decision support system frame...Energy-efficient technology investments using a decision support system frame...
Energy-efficient technology investments using a decision support system frame...
Emilio L. Cano
Ìý
resilience.io WASH sector prototype debut training workshop
resilience.io WASH sector prototype debut training workshopresilience.io WASH sector prototype debut training workshop
resilience.io WASH sector prototype debut training workshop
Ecological Sequestration Trust
Ìý
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenders
sscdotopen
Ìý
Power of call symput data
Power of call symput dataPower of call symput data
Power of call symput data
Yash Sharma
Ìý
Oracle SQL Advanced
Oracle SQL AdvancedOracle SQL Advanced
Oracle SQL Advanced
Dhananjay Goel
Ìý
Leveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetLeveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNet
agdavis
Ìý
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
tanuvir
Ìý
SAS Macros part 3
SAS Macros part 3SAS Macros part 3
SAS Macros part 3
venkatam
Ìý
Sequences classification based on group technology for flexible manufacturing...
Sequences classification based on group technology for flexible manufacturing...Sequences classification based on group technology for flexible manufacturing...
Sequences classification based on group technology for flexible manufacturing...
eSAT Journals
Ìý
AIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTIONAIRLINE FARE PRICE PREDICTION
AIRLINE FARE PRICE PREDICTION
IRJET Journal
Ìý
Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...Lab 2: Classification and Regression Prediction Models, training and testing ...
Lab 2: Classification and Regression Prediction Models, training and testing ...
Yao Yao
Ìý
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...Compensator Design for Speed Control of DC Motor by Root Locus Approach using...
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...
IRJET Journal
Ìý
BAPI - Criação de Ordem de Manutenção
BAPI - Criação de Ordem de ManutençãoBAPI - Criação de Ordem de Manutenção
BAPI - Criação de Ordem de Manutenção
Roberto Fernandes Ferreira
Ìý
Oracle_Analytical_function.pdf
Oracle_Analytical_function.pdfOracle_Analytical_function.pdf
Oracle_Analytical_function.pdf
KalyankumarVenkat1
Ìý
Lecture16_Process Analyzer and OPTQUEST.ppt
Lecture16_Process Analyzer and OPTQUEST.pptLecture16_Process Analyzer and OPTQUEST.ppt
Lecture16_Process Analyzer and OPTQUEST.ppt
AbdAbd72
Ìý
Machine Learning Foundations Project Presentation
Machine Learning Foundations Project PresentationMachine Learning Foundations Project Presentation
Machine Learning Foundations Project Presentation
Amit J Bhattacharyya
Ìý
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...
IRJET Journal
Ìý
GPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
GPU Accelerated Backtesting and Machine Learning for Quant Trading StrategiesGPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
GPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
Daniel Egloff
Ìý
SQL Server Query Optimization, Execution and Debugging Query Performance
SQL Server Query Optimization, Execution and Debugging Query PerformanceSQL Server Query Optimization, Execution and Debugging Query Performance
SQL Server Query Optimization, Execution and Debugging Query Performance
Vinod Kumar
Ìý
1 2 chem plantdesign-intro to plant design economics
1 2 chem plantdesign-intro to plant design  economics1 2 chem plantdesign-intro to plant design  economics
1 2 chem plantdesign-intro to plant design economics
ayimsevenfold
Ìý
Energy-efficient technology investments using a decision support system frame...
Energy-efficient technology investments using a decision support system frame...Energy-efficient technology investments using a decision support system frame...
Energy-efficient technology investments using a decision support system frame...
Emilio L. Cano
Ìý
resilience.io WASH sector prototype debut training workshop
resilience.io WASH sector prototype debut training workshopresilience.io WASH sector prototype debut training workshop
resilience.io WASH sector prototype debut training workshop
Ecological Sequestration Trust
Ìý
New Directions in Mahout's Recommenders
New Directions in Mahout's RecommendersNew Directions in Mahout's Recommenders
New Directions in Mahout's Recommenders
sscdotopen
Ìý
Power of call symput data
Power of call symput dataPower of call symput data
Power of call symput data
Yash Sharma
Ìý
Oracle SQL Advanced
Oracle SQL AdvancedOracle SQL Advanced
Oracle SQL Advanced
Dhananjay Goel
Ìý
Leveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNetLeveraging Feature Selection Within TreeNet
Leveraging Feature Selection Within TreeNet
agdavis
Ìý
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
tanuvir
Ìý
SAS Macros part 3
SAS Macros part 3SAS Macros part 3
SAS Macros part 3
venkatam
Ìý
Sequences classification based on group technology for flexible manufacturing...
Sequences classification based on group technology for flexible manufacturing...Sequences classification based on group technology for flexible manufacturing...
Sequences classification based on group technology for flexible manufacturing...
eSAT Journals
Ìý

Recently uploaded (20)

Class 3-Workforce profile updated P.pptx
Class 3-Workforce profile updated P.pptxClass 3-Workforce profile updated P.pptx
Class 3-Workforce profile updated P.pptx
angelananalucky
Ìý
Lesson 6- Data Visualization and Reporting.pptx
Lesson 6- Data Visualization and Reporting.pptxLesson 6- Data Visualization and Reporting.pptx
Lesson 6- Data Visualization and Reporting.pptx
1045858
Ìý
Introduction to database and analysis software’s suitable for.pptx
Introduction to database and analysis software’s suitable for.pptxIntroduction to database and analysis software’s suitable for.pptx
Introduction to database and analysis software’s suitable for.pptx
nabinparajuli9
Ìý
Stasiun kernel pabrik kelapa sawit indonesia
Stasiun kernel pabrik kelapa sawit indonesiaStasiun kernel pabrik kelapa sawit indonesia
Stasiun kernel pabrik kelapa sawit indonesia
fikrimanurung1
Ìý
Analyzing Consumer Spending Trends and Purchasing Behavior
Analyzing Consumer Spending Trends and Purchasing BehaviorAnalyzing Consumer Spending Trends and Purchasing Behavior
Analyzing Consumer Spending Trends and Purchasing Behavior
omololaokeowo1
Ìý
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdfValkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Dave Stokes
Ìý
Lecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptxLecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptx
humairafatima22
Ìý
IFRS Finance Powerpoint ppt Finance D.pptx
IFRS Finance Powerpoint  ppt Finance D.pptxIFRS Finance Powerpoint  ppt Finance D.pptx
IFRS Finance Powerpoint ppt Finance D.pptx
amantiwari2091
Ìý
"MIAO Ecosystem Financial Management PPT
"MIAO Ecosystem Financial Management PPT"MIAO Ecosystem Financial Management PPT
"MIAO Ecosystem Financial Management PPT
miao22
Ìý
MTC Supply Chain Management Strategy.pptx
MTC Supply Chain Management Strategy.pptxMTC Supply Chain Management Strategy.pptx
MTC Supply Chain Management Strategy.pptx
Rakshit Porwal
Ìý
Presentation.2 .reversal. reversal. pptx
Presentation.2 .reversal. reversal. pptxPresentation.2 .reversal. reversal. pptx
Presentation.2 .reversal. reversal. pptx
siliaselim87
Ìý
Optimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with CalciteOptimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with Calcite
Stamatis Zampetakis
Ìý
april 2024 paper 2 ms. english non fiction
april 2024 paper 2 ms. english non fictionapril 2024 paper 2 ms. english non fiction
april 2024 paper 2 ms. english non fiction
omokoredeolasunbomi
Ìý
Presentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysisPresentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysis
vatsalsingla4
Ìý
Updated Willow 2025 Media Deck_Updated010325.pdf
Updated Willow 2025 Media Deck_Updated010325.pdfUpdated Willow 2025 Media Deck_Updated010325.pdf
Updated Willow 2025 Media Deck_Updated010325.pdf
tangramcommunication
Ìý
iam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptxiam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptx
muhweziart
Ìý
Introduction to Java Programming for High School by ºÝºÝߣsgo.pptx
Introduction to Java Programming for High School by ºÝºÝߣsgo.pptxIntroduction to Java Programming for High School by ºÝºÝߣsgo.pptx
Introduction to Java Programming for High School by ºÝºÝߣsgo.pptx
mirhuzaifahali
Ìý
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
rossanthonytan130
Ìý
Boosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Scale22X 2025.pdfBoosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Scale22X 2025.pdf
Alkin Tezuysal
Ìý
AI + Disability. Coded Futures: Better opportunities or biased outcomes?
AI + Disability. Coded Futures: Better opportunities or biased outcomes?AI + Disability. Coded Futures: Better opportunities or biased outcomes?
AI + Disability. Coded Futures: Better opportunities or biased outcomes?
Christine Hemphill
Ìý
Class 3-Workforce profile updated P.pptx
Class 3-Workforce profile updated P.pptxClass 3-Workforce profile updated P.pptx
Class 3-Workforce profile updated P.pptx
angelananalucky
Ìý
Lesson 6- Data Visualization and Reporting.pptx
Lesson 6- Data Visualization and Reporting.pptxLesson 6- Data Visualization and Reporting.pptx
Lesson 6- Data Visualization and Reporting.pptx
1045858
Ìý
Introduction to database and analysis software’s suitable for.pptx
Introduction to database and analysis software’s suitable for.pptxIntroduction to database and analysis software’s suitable for.pptx
Introduction to database and analysis software’s suitable for.pptx
nabinparajuli9
Ìý
Stasiun kernel pabrik kelapa sawit indonesia
Stasiun kernel pabrik kelapa sawit indonesiaStasiun kernel pabrik kelapa sawit indonesia
Stasiun kernel pabrik kelapa sawit indonesia
fikrimanurung1
Ìý
Analyzing Consumer Spending Trends and Purchasing Behavior
Analyzing Consumer Spending Trends and Purchasing BehaviorAnalyzing Consumer Spending Trends and Purchasing Behavior
Analyzing Consumer Spending Trends and Purchasing Behavior
omololaokeowo1
Ìý
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdfValkey 101 - SCaLE 22x March 2025 Stokes.pdf
Valkey 101 - SCaLE 22x March 2025 Stokes.pdf
Dave Stokes
Ìý
Lecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptxLecture-AI and Alogor Parallel Aglorithms.pptx
Lecture-AI and Alogor Parallel Aglorithms.pptx
humairafatima22
Ìý
IFRS Finance Powerpoint ppt Finance D.pptx
IFRS Finance Powerpoint  ppt Finance D.pptxIFRS Finance Powerpoint  ppt Finance D.pptx
IFRS Finance Powerpoint ppt Finance D.pptx
amantiwari2091
Ìý
"MIAO Ecosystem Financial Management PPT
"MIAO Ecosystem Financial Management PPT"MIAO Ecosystem Financial Management PPT
"MIAO Ecosystem Financial Management PPT
miao22
Ìý
MTC Supply Chain Management Strategy.pptx
MTC Supply Chain Management Strategy.pptxMTC Supply Chain Management Strategy.pptx
MTC Supply Chain Management Strategy.pptx
Rakshit Porwal
Ìý
Presentation.2 .reversal. reversal. pptx
Presentation.2 .reversal. reversal. pptxPresentation.2 .reversal. reversal. pptx
Presentation.2 .reversal. reversal. pptx
siliaselim87
Ìý
Optimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with CalciteOptimizing Common Table Expressions in Apache Hive with Calcite
Optimizing Common Table Expressions in Apache Hive with Calcite
Stamatis Zampetakis
Ìý
april 2024 paper 2 ms. english non fiction
april 2024 paper 2 ms. english non fictionapril 2024 paper 2 ms. english non fiction
april 2024 paper 2 ms. english non fiction
omokoredeolasunbomi
Ìý
Presentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysisPresentation1.pptx for data and table analysis
Presentation1.pptx for data and table analysis
vatsalsingla4
Ìý
Updated Willow 2025 Media Deck_Updated010325.pdf
Updated Willow 2025 Media Deck_Updated010325.pdfUpdated Willow 2025 Media Deck_Updated010325.pdf
Updated Willow 2025 Media Deck_Updated010325.pdf
tangramcommunication
Ìý
iam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptxiam free indeed.pptxiam free indeed.pptx
iam free indeed.pptxiam free indeed.pptx
muhweziart
Ìý
Introduction to Java Programming for High School by ºÝºÝߣsgo.pptx
Introduction to Java Programming for High School by ºÝºÝߣsgo.pptxIntroduction to Java Programming for High School by ºÝºÝߣsgo.pptx
Introduction to Java Programming for High School by ºÝºÝߣsgo.pptx
mirhuzaifahali
Ìý
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx643663189-Q4W3-Synthesize-Information-1-pptx.pptx
643663189-Q4W3-Synthesize-Information-1-pptx.pptx
rossanthonytan130
Ìý
Boosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Scale22X 2025.pdfBoosting MySQL with Vector Search Scale22X 2025.pdf
Boosting MySQL with Vector Search Scale22X 2025.pdf
Alkin Tezuysal
Ìý
AI + Disability. Coded Futures: Better opportunities or biased outcomes?
AI + Disability. Coded Futures: Better opportunities or biased outcomes?AI + Disability. Coded Futures: Better opportunities or biased outcomes?
AI + Disability. Coded Futures: Better opportunities or biased outcomes?
Christine Hemphill
Ìý

DMA Analytics Challenge 2015 (Winner - First Position)

  • 3. ANALYTIC SOFTWARE USED  Data Preparation – SAS  Model Building – R  Hardware – Acer Aspire 5750 – 6 GB RAM
  • 4. SOLUTION OVERVIEW Data Preparation Missing Value Treatment •Nominal – New Category •Numeric/Ordinal – Replace with 0 (Value) New Variable Creation •Multiple derived Variables Model Tuning and Stacking Training / Blending /Testing Split Caret Function to tune Multiple Model parameters Stacking and Testing to optimize sequence Final Modeling 2 Stage Modeling process adopted Initial set of optimized models created in Stage 1 Scores incorporated into final blended Model in Stage 2 Scoring 2 Stage scoring process followed
  • 5. Model Tuning Process Stage 1 ModelingData Splitting Stage 2 Modeling Evaluation Phase Modeling Data Set – Random Assignment 50% ofObservations 30% ofObservations 20 % of Observations Stage 1 Models ï‚· Model 1 ï‚· Model 2 ï‚· Model 3 ï‚· Model 4 ï‚· Model 5 Scoreall 5 Models on Stage 2 Data, append scores as new variables Stage 2 Models ï‚· Model 1 ï‚· Model 2 ï‚· Model 3 ï‚· Model 4 ï‚· Model 5 Run Stage 1 Models Run Stage 2 Models Compare performance of all Stage 2 Models SOLUTION OVERVIEW – Continued (Model Tuning)
  • 6. DATA TRANSFORMATIONS  Mix of Linear and Non Linear (Tree Based) Models ‒ Cover each others weakness ‒ Tree based models are invariant to order preserving transformations (no need for Log/Exponent etc.)  More focus on feature engineering, new variables created as below  ‒ SHIP_RATIO  (ORDER_SH_AMT+ORDER_ADDL_SH_AMT)/ORDER_GROSS_AMT (Does shipping cost as a ratio of the initial order have any influence) ‒ PAYMT_RATIO=(ORDER_SH_AMT+ORDER_ADDL_SH_AMT+ORDER_GROSS_AMT)/PAYMENT_QTY (What is amount of each payment) ‒ REV_RATIO=TOTAL_REV_PRIOR_TO_A/TENURE (Revenue ratio per unit tenure) ‒ REV_PER_ORDER=TOTAL_REV_PRIOR_TO_A/TOTAL_ORDERS_PRIOR_TO_A (Revenue per order) ‒ FIRST_ORDER_RATIO=ORDER_GROSS_AMT/ITEM_QTY ‒ FIRST_PAYMENT_RATIO=ORDER_GROSS_AMT/PAYMENT_QTY ‒ ORDER_FREQ=TENURE/TOTAL_ORDERS_PRIOR_TO_A ‒ ORDER_DUE_RATIO=RECENCY/ORDER_FREQ ‒ ORDER_DUE_RATIO_2=(RECENCY-ORDER_FREQ)/ORDER_FREQ ‒ ORDER_DUE_RATIO_3=(RECENCY-ORDER_FREQ)/RECENCY ‒ All divide by zero exceptions set to 0
  • 7. Multiple Models trained on 50% of the data  Random Forests (randomForest)  AdaBoost (ada)  Gradient Boosting Machines (gbm)  eXtreme Gradient Boost (xgboost)  Logistic Regression (variables selected by studying glmnet output)  Regularized Logistic Regression (glmnet) Several of the above models have tunable parameters  Caret package in R used to cycle through various combinations of input parameters using multiple folds  Problem statement specifies rank order primacy, hence ROC metric maximized Stage 1 Models
  • 8.  All 5 Models built in stage 1 used to score both Stage 2 and evaluation data  5 score columns added back to the data set (stage 2 and evaluation)  4 Models created again on Stage 2 dataset  Stage 1 and Stage 2 models are scored on evaluation dataset  ROC (AUC) calculated for the models on evaluation dataset  Best Model identified – xgboost (Stage 2) Model Stage 1 (AUC) On EvaluationSet Stage 2 (AUC) On EvaluationSet xgboost 0.646 0.647 logit 0.641 0.646 gbm 0.636 0.644 glmnet 0.641 0.642 ada 0.637 0.642 random forest 0.617 NA Stage 2 Models
  • 9.  Data split as 50-50 between Stage 1 modeling and Stage 2 blending  Xgboost used to blend in Stage 2  Initial 5 models score the submission dataset and scores merged back to create dataset for sixth model  Blend Model used to generate the final submission score Final Model Building
  • 10. Important Variables TXN_CHANNEL_CD PAYMENT_QTY RUSH_ORD_FLAG SHIP_RATIO FIRST_ORDER_RATIO DEMOGRAPHIC_SEGMENT ORDER_GROSS_AMT RETAIL/CATALOG_SPENDING_QUINTILE REV_PER_ORDER HH_INCOME PAYMT_RATIO ETHNICITY LANGUAGE  Mix of ready and derived variables  Ranking of top variables can be difficult to quantify across multiple modeling techniques/blends  Plain logistic regression with these variables can create a Model with comparable performance (~.64 AUC) TOP VARIABLES
  • 11.  Derived Variables ‒ Create as many behavioral/pattern variables as possible ‒ Ratios such as revenue/order, order frequency, shipping cost to total cost etc.  Cross Validation for controlling overfit ‒ K fold (maximum possible) validation runs ‒ Tune parameters (control depth and boosting rounds to maximize test ROC) ‒ Use grid search for optimum parameter search or employ Caret package KEYS TO SUCCESS