The winning team was Karan Sarao from India. They used a two-stage modeling process with five initial models that were then scored on a validation set and those scores incorporated into a final blended model. Key aspects of their solution included feature engineering by creating new derived variables and extensive model tuning using cross-validation. Their top-performing model was an extreme gradient boosted model.
Session 4 c discussion of xianjia ye paper in session 4a 26 augustIARIW 2014
Ìý
This paper aims to explain differences in structural change paths across countries using a measure of relatedness between economic activities. It uses World Input-Output Database data to construct a matrix measuring relatedness between 84 activities based on comparative advantage. Clustering and network analysis of this matrix reveal patterns of related activities. Econometric models test if relatedness predicts future comparative advantage gains, finding employment in highly related activities promotes structural upgrading. However, moving between low- and high-skill activities is difficult. Overall, the paper presents a novel approach using international input-output data to empirically study patterns of structural change.
Control charts are statistical tools used to monitor processes and distinguish between common and special cause variations. They graphically display process stability over time and can provide early warnings if a process becomes out of control. The X-bar and R chart is used for variables data with subgroup sizes of 2-15. It involves calculating the mean and range for each subgroup, then determining control limits based on the grand mean and average range. Patterns outside the control limits or showing trends over time indicate the process may need investigation.
This document provides an overview of basic statistics concepts. It defines statistics, describes who uses statistics, and outlines descriptive and inferential statistics. It also defines types of variables, population and sample, measures of central tendency including mean, median and mode, and measures of dispersion including range, variance and standard deviation. Frequency distribution is discussed as a method to organize grouped quantitative data into classes with their frequencies. The normal curve is briefly mentioned as well.
The document discusses the theory of control charts for quality management. It explains that variation exists in manufacturing processes due to random and assignable causes. Control charts graphically show whether a process is stable or unstable over time by plotting measures of central tendency and dispersion. The X-bar and R charts are control chart methods for variables that plot average and range values of subgroups. They establish upper and lower control limits based on the mean and standard deviation to identify processes that are out of statistical control.
The document provides information on control charts for attributes used in quality management. It discusses P charts which show the proportion of nonconforming units in a sample and are used when each unit is judged as pass/fail. It also discusses C charts which count the number of nonconformities in a sample and are based on the Poisson distribution. The document provides steps for constructing P charts, including determining sample size, collecting data, calculating control limits, and revising limits if the process is out of control. It also discusses np charts and C charts which are similar but track the number rather than proportion of nonconformities.
The 7 basic quality tools through minitab 18RAMAR BOSE
Ìý
The document provides an overview of creating and customizing control charts in Minitab. It explains how to create an I-MR chart and Xbar-R chart from sample data files, including how to select test criteria, format scales and axes, and add reference lines. The document also provides general information about when to use control charts and considerations for the type of data needed to create these charts.
This document discusses setting up an orthogonal array for an experiment involving an injection molding machine. It begins by providing background on orthogonal arrays and their advantages in design of experiments. It then presents a case study involving 4 factors (temperature, pressure, time, adhesive) each with 2 levels, and identifies the significant interaction between temperature and pressure. The document recommends selecting an L8 orthogonal array since it is best suited for 4 factors with 2 levels. It shows how to populate the L8 array with the factor levels and interactions. Finally, it discusses analyzing the results using signal-to-noise ratios to optimize the robust design for maximum tensile strength.
This document summarizes a research project that aims to develop an application to predict airline ticket prices using machine learning techniques. The researchers collected over 10,000 records of flight data including features like source, destination, date, time, number of stops, and price. They preprocessed the data, selected important features, and applied machine learning algorithms like linear regression, decision trees, and random forests to build predictive models. The random forest model provided the most accurate predictions according to performance metrics like MAE, MSE, and RMSE. The researchers propose deploying the best model in a web application using Flask for the backend and Bootstrap for the frontend so users can input flight details and receive predicted price outputs.
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
Ìý
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance, Feature ranking with recursive feature elimination, Two dimensional Linear Discriminant Analysis
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...IRJET Journal
Ìý
The document discusses designing a compensator for speed control of a DC motor using the root locus approach in MATLAB. It first presents the problem statement of controlling the speed of a DC motor and introduces using a compensator as a controller. It then provides the design procedure for three types of compensators: lead compensator, lag compensator, and lag-lead compensator. The procedures include calculating transfer functions, plotting root loci, and determining pole and zero locations. Flow charts of the MATLAB program for each compensator type are presented. Finally, the results of applying each compensator are shown and their effects on performance parameters like settling time are compared.
This document provides documentation on the BAPI_ALM_ORDER_MAINTAIN Business Application Programming Interface (BAPI), which is used to create, change, and maintain maintenance or service orders and their subobjects in SAP. The BAPI allows processing of order header, partner, user status, operations, relationships, components, texts, settlement rules, and other objects. It utilizes method and data tables to specify the operations to be performed on each object.
Oracle provides several analytical functions that allow for powerful data analysis using SQL. These include group functions that aggregate data over groups or windows, as well as window functions like ROW_NUMBER, RANK, and LAG that analyze data relative to the current row. ROLLUP and CUBE extensions to the GROUP BY clause enable calculation of subtotals across multiple dimensions of data with a single query.
Lecture16_Process Analyzer and OPTQUEST.pptAbdAbd72
Ìý
1. The document discusses using OptQuest in Arena simulation software to optimize models through experimenting with alternative scenarios.
2. It provides examples of using OptQuest to determine the optimal capacity of resources (like loaders) to minimize the total system cost. Controls, responses and objectives are defined for the optimization.
3. The optimal solution identifies the best values for objectives like minimum total cost and minimum time of run, as well as the optimal values for controls like resource capacities.
This was the presentation for the project submitted for the Machine Learning course from a faculty of Duke University on coursera.com. This presentation talks about an ML model based on a multiple linear regression approach which helps predict power output for a given set of input values for different features considered in this problem.
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...IRJET Journal
Ìý
The document describes a hybrid genetic algorithm-expert system approach for operation sequencing and machining parameter selection in computer-aided process planning (CAPP) for cylindrical parts. It involves using a genetic algorithm to generate feasible sequences of machining operations based on a precedence cost matrix. An expert system is then used to select optimal machining parameters for turning, boring, and facing operations based on the part material. The hybrid approach aims to reduce computational time and generate alternative optimal sequences for improving production flexibility in changing manufacturing environments. A case study application involving 8 operations on a carbon steel part is presented to validate the proposed approach.
GPU Accelerated Backtesting and Machine Learning for Quant Trading StrategiesDaniel Egloff
Ìý
This document discusses using GPU acceleration for quant trading strategies. It describes challenges in engineering trading strategies, using random forests to generate buy/sell signals from market data, and optimizing strategies through walk-forward testing on GPUs. Random forests are trained on features and market returns to produce signals. Strategies are backtested on markets by calculating P&L from signals and returns. The best strategy is selected, and hypothesis tests determine if strategy returns are statistically greater than zero. GPUs can accelerate this process through parallelism at multiple levels.
SQL Server Query Optimization, Execution and Debugging Query PerformanceVinod Kumar
Ìý
This document summarizes a presentation on SQL Server query optimization, execution, and debugging query performance. The key takeaways are: 1) estimated row counts affect plan selection because the optimizer uses cardinality estimates to determine query plan costs; 2) tools like SSMS, DMVs, and tracing can be used to find estimated and actual row counts and compare query plans; 3) parameterized queries can cause issues if the plan is reused for parameter values that expected different plans. Techniques for influencing plan choice and testing considerations for parameterized queries are also discussed.
1 2 chem plantdesign-intro to plant design economicsayimsevenfold
Ìý
This document outlines a 4 credit-hour core course on process plant design. It will be taught over the semester of May 2011 by two lecturers. The course aims to teach students the approaches and stages of chemical process design, from conceptual design to equipment design, safety and control, and economic analysis. It will cover topics like flowsheet development, optimization methods, and design criteria beyond just economics. The document provides an example lecture content on introducing chemical processes and the need for a systematic design approach.
Energy-efficient technology investments using a decision support system frame...Emilio L. Cano
Ìý
This document presents an integrated framework for decision support systems using R. It describes using R and related packages to represent stochastic energy optimization problems, generate input files for solvers, analyze results, and produce reproducible reports. Stochastic models are developed and solved within this framework. The framework allows statistical analysis, graphical output, model equations, solver inputs/outputs, and comprehensive reports to be combined for modeling, analysis, and stakeholder communication.
Installation
resilience.io Package Overview
Using the model –step by step
resilience.io Testing Capabilities (and Limitations)
resilience.io Use Examples
Q&A / Interactive Session
This document summarizes recent developments in Mahout's recommender systems since the publication of Mahout in Action over two years ago. It describes new single machine recommenders, factor models, and item-based collaborative filtering algorithms that can be computed in parallel on Hadoop. Experimental results show these new methods can analyze the Yahoo! Songs dataset of 700 million ratings across 26 machines in under 40 minutes for item similarities and 2 minutes per iteration for matrix factorization.
This document discusses using the CALL SYMPUT routine to transfer information between DATA step program steps. It provides three examples: 1) creating dummy variables for all possible values of a variable, 2) generating labels for variables using existing formats, and 3) using the BYTE function to assign alphabetically ordered names to datasets created from raw data files. CALL SYMPUT assigns values produced in a DATA step to macro variables, allowing dynamic communication between SAS language and macros.
This presentation deals with the advanced features of SQL comprising of Arithmetic Calculations, Analytical Function, PIVOT etc. Presented by Alphalogic Inc: https://www.alphalogicinc.com/
Leveraging Feature Selection Within TreeNetagdavis
Ìý
This document discusses leveraging feature selection within TreeNet models. It describes how feature selection can improve model performance by identifying the subset of variables that provide the most information gain. The document outlines different feature selection methods like variable shaving, forward selection, and backward selection. It also presents a case study on a marketing dataset that applies these methods and finds that feature selection helped identify optimal variable subsets of size 25, 71, and 72 using different techniques. Overall, the document advocates for using feature selection to optimize TreeNet models.
Logistic regression is a machine learning algorithm used for classification. Apache Mahout is a scalable machine learning library that includes an implementation of logistic regression using the stochastic gradient descent algorithm. The document demonstrates how to use Mahout's logistic regression on a sample dataset to classify points based on their features and predict whether they are filled or empty. It shows training a model, evaluating performance on the training data, and selecting additional features to improve the model.
This document discusses different ways to create macro variables from within a DATA step in SAS. It describes using the SYMPUT routine to assign values from DATA step variables or expressions to macro variables. Multiple macro variables can be created in a single DATA step using SYMPUT with expressions for both the macro variable name and value. Indirect referencing of macro variables allows their values to be resolved at a later time.
Sequences classification based on group technology for flexible manufacturing...eSAT Journals
Ìý
Abstract Flexible cell formation is based on Group Technology. Group Technology rests on the exploitation of resemblances between products or processes, which makes the identification of products’ families and machines’ cells easier. We propose a new approach based on the language theory for product family grouping according to their manufacturing sequences. This approach uses linear sequences of the manufacturing products which are assimilated to the words of a language. We have chosen the Levenhstein distance for sequence classification. We are going to compare our method to Dice-Czekanowski and Jaccard’s methods and apply the vectorial correlation coefficient as a comparison tool between two hierarchical classifications. Keywords: manufacturing sequences, language theory, hierarchical classification, Group Technology.
This document summarizes a research project that aims to develop an application to predict airline ticket prices using machine learning techniques. The researchers collected over 10,000 records of flight data including features like source, destination, date, time, number of stops, and price. They preprocessed the data, selected important features, and applied machine learning algorithms like linear regression, decision trees, and random forests to build predictive models. The random forest model provided the most accurate predictions according to performance metrics like MAE, MSE, and RMSE. The researchers propose deploying the best model in a web application using Flask for the backend and Bootstrap for the frontend so users can input flight details and receive predicted price outputs.
Lab 2: Classification and Regression Prediction Models, training and testing ...Yao Yao
Ìý
https://github.com/yaowser/data_mining_group_project
https://www.kaggle.com/c/zillow-prize-1/data
From the Zillow real estate data set of properties in the southern California area, conduct the following data cleaning, data analysis, predictive analysis, and machine learning algorithms:
Lab 2: Classification and Regression Prediction Models, training and testing splits, optimization of K Nearest Neighbors (KD tree), optimization of Random Forest, optimization of Naive Bayes (Gaussian), advantages and model comparisons, feature importance, Feature ranking with recursive feature elimination, Two dimensional Linear Discriminant Analysis
Compensator Design for Speed Control of DC Motor by Root Locus Approach using...IRJET Journal
Ìý
The document discusses designing a compensator for speed control of a DC motor using the root locus approach in MATLAB. It first presents the problem statement of controlling the speed of a DC motor and introduces using a compensator as a controller. It then provides the design procedure for three types of compensators: lead compensator, lag compensator, and lag-lead compensator. The procedures include calculating transfer functions, plotting root loci, and determining pole and zero locations. Flow charts of the MATLAB program for each compensator type are presented. Finally, the results of applying each compensator are shown and their effects on performance parameters like settling time are compared.
This document provides documentation on the BAPI_ALM_ORDER_MAINTAIN Business Application Programming Interface (BAPI), which is used to create, change, and maintain maintenance or service orders and their subobjects in SAP. The BAPI allows processing of order header, partner, user status, operations, relationships, components, texts, settlement rules, and other objects. It utilizes method and data tables to specify the operations to be performed on each object.
Oracle provides several analytical functions that allow for powerful data analysis using SQL. These include group functions that aggregate data over groups or windows, as well as window functions like ROW_NUMBER, RANK, and LAG that analyze data relative to the current row. ROLLUP and CUBE extensions to the GROUP BY clause enable calculation of subtotals across multiple dimensions of data with a single query.
Lecture16_Process Analyzer and OPTQUEST.pptAbdAbd72
Ìý
1. The document discusses using OptQuest in Arena simulation software to optimize models through experimenting with alternative scenarios.
2. It provides examples of using OptQuest to determine the optimal capacity of resources (like loaders) to minimize the total system cost. Controls, responses and objectives are defined for the optimization.
3. The optimal solution identifies the best values for objectives like minimum total cost and minimum time of run, as well as the optimal values for controls like resource capacities.
This was the presentation for the project submitted for the Machine Learning course from a faculty of Duke University on coursera.com. This presentation talks about an ML model based on a multiple linear regression approach which helps predict power output for a given set of input values for different features considered in this problem.
Operation Sequencing and Machining Parameter Selection in CAPP for Cylindrica...IRJET Journal
Ìý
The document describes a hybrid genetic algorithm-expert system approach for operation sequencing and machining parameter selection in computer-aided process planning (CAPP) for cylindrical parts. It involves using a genetic algorithm to generate feasible sequences of machining operations based on a precedence cost matrix. An expert system is then used to select optimal machining parameters for turning, boring, and facing operations based on the part material. The hybrid approach aims to reduce computational time and generate alternative optimal sequences for improving production flexibility in changing manufacturing environments. A case study application involving 8 operations on a carbon steel part is presented to validate the proposed approach.
GPU Accelerated Backtesting and Machine Learning for Quant Trading StrategiesDaniel Egloff
Ìý
This document discusses using GPU acceleration for quant trading strategies. It describes challenges in engineering trading strategies, using random forests to generate buy/sell signals from market data, and optimizing strategies through walk-forward testing on GPUs. Random forests are trained on features and market returns to produce signals. Strategies are backtested on markets by calculating P&L from signals and returns. The best strategy is selected, and hypothesis tests determine if strategy returns are statistically greater than zero. GPUs can accelerate this process through parallelism at multiple levels.
SQL Server Query Optimization, Execution and Debugging Query PerformanceVinod Kumar
Ìý
This document summarizes a presentation on SQL Server query optimization, execution, and debugging query performance. The key takeaways are: 1) estimated row counts affect plan selection because the optimizer uses cardinality estimates to determine query plan costs; 2) tools like SSMS, DMVs, and tracing can be used to find estimated and actual row counts and compare query plans; 3) parameterized queries can cause issues if the plan is reused for parameter values that expected different plans. Techniques for influencing plan choice and testing considerations for parameterized queries are also discussed.
1 2 chem plantdesign-intro to plant design economicsayimsevenfold
Ìý
This document outlines a 4 credit-hour core course on process plant design. It will be taught over the semester of May 2011 by two lecturers. The course aims to teach students the approaches and stages of chemical process design, from conceptual design to equipment design, safety and control, and economic analysis. It will cover topics like flowsheet development, optimization methods, and design criteria beyond just economics. The document provides an example lecture content on introducing chemical processes and the need for a systematic design approach.
Energy-efficient technology investments using a decision support system frame...Emilio L. Cano
Ìý
This document presents an integrated framework for decision support systems using R. It describes using R and related packages to represent stochastic energy optimization problems, generate input files for solvers, analyze results, and produce reproducible reports. Stochastic models are developed and solved within this framework. The framework allows statistical analysis, graphical output, model equations, solver inputs/outputs, and comprehensive reports to be combined for modeling, analysis, and stakeholder communication.
Installation
resilience.io Package Overview
Using the model –step by step
resilience.io Testing Capabilities (and Limitations)
resilience.io Use Examples
Q&A / Interactive Session
This document summarizes recent developments in Mahout's recommender systems since the publication of Mahout in Action over two years ago. It describes new single machine recommenders, factor models, and item-based collaborative filtering algorithms that can be computed in parallel on Hadoop. Experimental results show these new methods can analyze the Yahoo! Songs dataset of 700 million ratings across 26 machines in under 40 minutes for item similarities and 2 minutes per iteration for matrix factorization.
This document discusses using the CALL SYMPUT routine to transfer information between DATA step program steps. It provides three examples: 1) creating dummy variables for all possible values of a variable, 2) generating labels for variables using existing formats, and 3) using the BYTE function to assign alphabetically ordered names to datasets created from raw data files. CALL SYMPUT assigns values produced in a DATA step to macro variables, allowing dynamic communication between SAS language and macros.
This presentation deals with the advanced features of SQL comprising of Arithmetic Calculations, Analytical Function, PIVOT etc. Presented by Alphalogic Inc: https://www.alphalogicinc.com/
Leveraging Feature Selection Within TreeNetagdavis
Ìý
This document discusses leveraging feature selection within TreeNet models. It describes how feature selection can improve model performance by identifying the subset of variables that provide the most information gain. The document outlines different feature selection methods like variable shaving, forward selection, and backward selection. It also presents a case study on a marketing dataset that applies these methods and finds that feature selection helped identify optimal variable subsets of size 25, 71, and 72 using different techniques. Overall, the document advocates for using feature selection to optimize TreeNet models.
Logistic regression is a machine learning algorithm used for classification. Apache Mahout is a scalable machine learning library that includes an implementation of logistic regression using the stochastic gradient descent algorithm. The document demonstrates how to use Mahout's logistic regression on a sample dataset to classify points based on their features and predict whether they are filled or empty. It shows training a model, evaluating performance on the training data, and selecting additional features to improve the model.
This document discusses different ways to create macro variables from within a DATA step in SAS. It describes using the SYMPUT routine to assign values from DATA step variables or expressions to macro variables. Multiple macro variables can be created in a single DATA step using SYMPUT with expressions for both the macro variable name and value. Indirect referencing of macro variables allows their values to be resolved at a later time.
Sequences classification based on group technology for flexible manufacturing...eSAT Journals
Ìý
Abstract Flexible cell formation is based on Group Technology. Group Technology rests on the exploitation of resemblances between products or processes, which makes the identification of products’ families and machines’ cells easier. We propose a new approach based on the language theory for product family grouping according to their manufacturing sequences. This approach uses linear sequences of the manufacturing products which are assimilated to the words of a language. We have chosen the Levenhstein distance for sequence classification. We are going to compare our method to Dice-Czekanowski and Jaccard’s methods and apply the vectorial correlation coefficient as a comparison tool between two hierarchical classifications. Keywords: manufacturing sequences, language theory, hierarchical classification, Group Technology.
Analyzing Consumer Spending Trends and Purchasing Behavioromololaokeowo1
Ìý
This project explores consumer spending patterns using Kaggle-sourced data to uncover key trends in purchasing behavior. The analysis involved cleaning and preparing the data, performing exploratory data analysis (EDA), and visualizing insights using ExcelI. Key focus areas included customer demographics, product performance, seasonal trends, and pricing strategies. The project provided actionable insights into consumer preferences, helping businesses optimize sales strategies and improve decision-making.
Valkey 101 - SCaLE 22x March 2025 Stokes.pdfDave Stokes
Ìý
An Introduction to Valkey, Presented March 2025 at the Southern California Linux Expo, Pasadena CA. Valkey is a replacement for Redis and is a very fast in memory database, used to caches and other low latency applications. Valkey is open-source software and very fast.
Optimizing Common Table Expressions in Apache Hive with CalciteStamatis Zampetakis
Ìý
In many real-world queries, certain expressions may appear multiple times, requiring repeated computations to construct the final result. These recurring computations, known as common table expressions (CTEs), can be explicitly defined in SQL queries using the WITH clause or implicitly derived through transformation rules. Identifying and leveraging CTEs is essential for reducing the cost of executing complex queries and is a critical component of modern data management systems.
Apache Hive, a SQL-based data management system, provides powerful mechanisms to detect and exploit CTEs through heuristic and cost-based optimization techniques.
This talk delves into the internals of Hive's planner, focusing on its integration with Apache Calcite for CTE optimization. We will begin with a high-level overview of Hive's planner architecture and its reliance on Calcite in various planning phases. The discussion will then shift to the CTE rewriting phase, highlighting key Calcite concepts and demonstrating how they are employed to optimize CTEs effectively.
Boosting MySQL with Vector Search Scale22X 2025.pdfAlkin Tezuysal
Ìý
As the demand for vector databases and Generative AI continues to rise, integrating vector storage and search capabilities into traditional databases has become increasingly important. This session introduces the *MyVector Plugin*, a project that brings native vector storage and similarity search to MySQL. Unlike PostgreSQL, which offers interfaces for adding new data types and index methods, MySQL lacks such extensibility. However, by utilizing MySQL's server component plugin and UDF, the *MyVector Plugin* successfully adds a fully functional vector search feature within the existing MySQL + InnoDB infrastructure, eliminating the need for a separate vector database. The session explains the technical aspects of integrating vector support into MySQL, the challenges posed by its architecture, and real-world use cases that showcase the advantages of combining vector search with MySQL's robust features. Attendees will leave with practical insights on how to add vector search capabilities to their MySQL
AI + Disability. Coded Futures: Better opportunities or biased outcomes?Christine Hemphill
Ìý
A summary report into attitudes to and implications of AI as it relates to disability. Will AI enabled solutions create greater opportunities or amplify biases in society and datasets? Informed by primary mixed methods research conducted in the UK and globally by Open Inclusion on behalf of the Institute of People Centred AI, Uni of Surrey and Royal Holloway University. Initially presented at Google London in Jan 2025.
3. ANALYTIC SOFTWARE USED
 Data Preparation – SAS
 Model Building – R
 Hardware
– Acer Aspire 5750
– 6 GB RAM
4. SOLUTION OVERVIEW
Data Preparation
Missing Value Treatment
•Nominal – New Category
•Numeric/Ordinal – Replace with 0 (Value)
New Variable Creation
•Multiple derived Variables
Model Tuning and
Stacking
Training / Blending /Testing Split
Caret Function to tune Multiple
Model parameters
Stacking and Testing to optimize
sequence
Final Modeling
2 Stage Modeling process adopted
Initial set of optimized models
created in Stage 1
Scores incorporated into final blended
Model in Stage 2
Scoring
2 Stage scoring process followed
5. Model Tuning Process
Stage 1 ModelingData Splitting Stage 2 Modeling Evaluation
Phase
Modeling Data Set –
Random Assignment
50% ofObservations
30% ofObservations
20 % of
Observations
Stage 1 Models
ï‚· Model 1
ï‚· Model 2
ï‚· Model 3
ï‚· Model 4
ï‚· Model 5
Scoreall 5 Models
on Stage 2 Data,
append scores as
new variables
Stage 2 Models
ï‚· Model 1
ï‚· Model 2
ï‚· Model 3
ï‚· Model 4
ï‚· Model 5
Run Stage 1 Models
Run Stage 2 Models
Compare
performance of all
Stage 2 Models
SOLUTION OVERVIEW – Continued (Model Tuning)
6. DATA TRANSFORMATIONS
 Mix of Linear and Non Linear (Tree Based) Models
‒ Cover each others weakness
‒ Tree based models are invariant to order preserving transformations (no need for Log/Exponent etc.)
 More focus on feature engineering, new variables created as below ïƒ
‒ SHIP_RATIO ïƒ (ORDER_SH_AMT+ORDER_ADDL_SH_AMT)/ORDER_GROSS_AMT (Does shipping cost as a ratio of the initial
order have any influence)
‒ PAYMT_RATIO=(ORDER_SH_AMT+ORDER_ADDL_SH_AMT+ORDER_GROSS_AMT)/PAYMENT_QTY (What is amount of each
payment)
‒ REV_RATIO=TOTAL_REV_PRIOR_TO_A/TENURE (Revenue ratio per unit tenure)
‒ REV_PER_ORDER=TOTAL_REV_PRIOR_TO_A/TOTAL_ORDERS_PRIOR_TO_A (Revenue per order)
‒ FIRST_ORDER_RATIO=ORDER_GROSS_AMT/ITEM_QTY
‒ FIRST_PAYMENT_RATIO=ORDER_GROSS_AMT/PAYMENT_QTY
‒ ORDER_FREQ=TENURE/TOTAL_ORDERS_PRIOR_TO_A
‒ ORDER_DUE_RATIO=RECENCY/ORDER_FREQ
‒ ORDER_DUE_RATIO_2=(RECENCY-ORDER_FREQ)/ORDER_FREQ
‒ ORDER_DUE_RATIO_3=(RECENCY-ORDER_FREQ)/RECENCY
‒ All divide by zero exceptions set to 0
7. Multiple Models trained on 50% of the data
 Random Forests (randomForest)
 AdaBoost (ada)
 Gradient Boosting Machines (gbm)
 eXtreme Gradient Boost (xgboost)
 Logistic Regression (variables selected by studying glmnet output)
 Regularized Logistic Regression (glmnet)
Several of the above models have tunable parameters
 Caret package in R used to cycle through various combinations of input parameters
using multiple folds
 Problem statement specifies rank order primacy, hence ROC metric maximized
Stage 1 Models
8.  All 5 Models built in stage 1 used to score both Stage 2 and evaluation data
 5 score columns added back to the data set (stage 2 and evaluation)
 4 Models created again on Stage 2 dataset
 Stage 1 and Stage 2 models are scored on evaluation dataset
 ROC (AUC) calculated for the models on evaluation dataset
 Best Model identified – xgboost (Stage 2)
Model Stage 1 (AUC)
On EvaluationSet
Stage 2 (AUC)
On EvaluationSet
xgboost 0.646 0.647
logit 0.641 0.646
gbm 0.636 0.644
glmnet 0.641 0.642
ada 0.637 0.642
random forest 0.617 NA
Stage 2 Models
9.  Data split as 50-50 between Stage 1 modeling and Stage 2 blending
 Xgboost used to blend in Stage 2
 Initial 5 models score the submission dataset and scores merged
back to create dataset for sixth model
 Blend Model used to generate the final submission score
Final Model Building
11.  Derived Variables
‒ Create as many behavioral/pattern variables as possible
‒ Ratios such as revenue/order, order frequency, shipping cost to total cost etc.
 Cross Validation for controlling overfit
‒ K fold (maximum possible) validation runs
‒ Tune parameters (control depth and boosting rounds to maximize test ROC)
‒ Use grid search for optimum parameter search or employ Caret package
KEYS TO SUCCESS