際際滷

際際滷Share a Scribd company logo
Photo By: David Doubilet
CIKM AnalytiCup
Lazada Product Title Quality Challenge
1
$6,000
2$2,000
3$1,000
$2,000
Team Members
Tam T. Nguyen
nthanhtam@gmail.com
Postdoctoral Research Fellow
Ryerson University
Kaggle Grandmaster
Hossein Fani
hosseinfani@gmail.com
PhD Student
University of New Brunswick
Gilberto Titericz
giba1978@gmail.com
Machine Learning Expert
AirBnb Inc.
Kaggle Grandmaster
Ebrahim Bagheri
ebrahim.bagheri@gmail.com
Associate Professor
Ryerson University
Photo By: Justin Hofman
hot sexy red clutch rug sack travel backpack unisex cheap with free gift
1
clarity
2
conciseness
Hot Sexy Tom Clovers Womens Mens Classy Look Cool Simple Style Casual
Canvas Crossbody Messenger Bag Handbag Fashion Bag Tote Handbag Gray
Problem Setting
Photo By: David Doubilet
Clarity if within five seconds one can understand the title, what the product is, and quickly figure out the key
attributes (color, size, model, ...).
Conciseness if it is short enough to contain all the necessary information. Otherwise, i.e., the title is
too long with many unnecessary words, Or it is too short such that it is unsure what the product is.
Data Set
CIKM AnalytiCup 2017: Bagging Model for Product Title Quality with Noise
ML-DM
1. Cleansing
 Noise
 Missing Values
 Outliers
2. Flirting
 Attributes
 Labels (if any)
 Augmentation
3. Feature Eng.
 Extraction
 Reduction
 Selection
4. Model Eng.
 Selection
 Tuning
 Evaluation
1. Cleansing
 Noise
 Html tags in short_description (%94)
 Missing Values
 product_type (less than %1)
 category_lvl_3 (about %6)  assign category_lvl_2
 description (less than %1)
 Outliers
 price {-1, 999999, 9999999},
 price Normalization based on country
2. Flirting
 Attributes
 Color
 Brand
 Non-English
 <img> Image
 <li> enumeration
 : Labels
 Disagreement in labels!(label noise)
 Augmentation
 Cloning  color, brand
Label Noise
CIKM AnalytiCup 2017: Bagging Model for Product Title Quality with Noise
multi-class
: 1  2       : 1, 2,  ,  
binary(boolean) classifier: : 0,1
multi-output(label)
: 1  2       1: 1, 2,  ,  1
 2: 1, 2,  ,  2
   : 1, 2,  ,  r
multi-output binary(boolean) classifier: 1: 0,1  2: 0,1
Targets correlation: (single, fast model for all targets)
Only 3 combinations for (Clear,Concise):
(1,0), (1,1), (0,0)  |~Clear & Concise|= 0
if ~Clear then ~Concise
if Concise then Clear
CIKM AnalytiCup 2017: Bagging Model for Product Title Quality with Noise
3. Feature Eng.
 Extraction
 Reduction
 LSA,T-SNE,PCA,SVD
 Selection
 STD
 Correlation X~y
 Linear(t-test, chi2)
 Non-linear(mi)
 Model-driven
 LinearSVM
Feature Engineering
Feature Importance
Linear SVM
CIKM AnalytiCup 2017: Bagging Model for Product Title Quality with Noise
10-Fold Set 1 10-Fold Set 2 10-Fold Set 3 10-Fold Set 4
Base Model
Ensemble Model
Final Prediction
Fold Bagging
Fold Bagging
Set Fold Bagging
BLENDBLEND BLEND BLENDSTACK STACK STACK STACK
BLENDBLEND BLEND BLEND
BLEND
Bagging Models
Performance Evaluation
SGD: stochastic gradient descent
LOR: logistic regression
RDG: ridge regression
NBC: naive bayes classifier
XGB: extreme gradient boosting
LGB: light gradient boosting
W2V: word2vec
Model Importance
clarity conciseness
CIKM AnalytiCup 2017: Bagging Model for Product Title Quality with Noise

More Related Content

CIKM AnalytiCup 2017: Bagging Model for Product Title Quality with Noise

  • 1. Photo By: David Doubilet
  • 2. CIKM AnalytiCup Lazada Product Title Quality Challenge 1 $6,000 2$2,000 3$1,000 $2,000
  • 3. Team Members Tam T. Nguyen nthanhtam@gmail.com Postdoctoral Research Fellow Ryerson University Kaggle Grandmaster Hossein Fani hosseinfani@gmail.com PhD Student University of New Brunswick Gilberto Titericz giba1978@gmail.com Machine Learning Expert AirBnb Inc. Kaggle Grandmaster Ebrahim Bagheri ebrahim.bagheri@gmail.com Associate Professor Ryerson University
  • 5. hot sexy red clutch rug sack travel backpack unisex cheap with free gift 1 clarity 2 conciseness Hot Sexy Tom Clovers Womens Mens Classy Look Cool Simple Style Casual Canvas Crossbody Messenger Bag Handbag Fashion Bag Tote Handbag Gray Problem Setting
  • 6. Photo By: David Doubilet
  • 7. Clarity if within five seconds one can understand the title, what the product is, and quickly figure out the key attributes (color, size, model, ...). Conciseness if it is short enough to contain all the necessary information. Otherwise, i.e., the title is too long with many unnecessary words, Or it is too short such that it is unsure what the product is. Data Set
  • 9. ML-DM 1. Cleansing Noise Missing Values Outliers 2. Flirting Attributes Labels (if any) Augmentation 3. Feature Eng. Extraction Reduction Selection 4. Model Eng. Selection Tuning Evaluation
  • 10. 1. Cleansing Noise Html tags in short_description (%94) Missing Values product_type (less than %1) category_lvl_3 (about %6) assign category_lvl_2 description (less than %1) Outliers price {-1, 999999, 9999999}, price Normalization based on country
  • 11. 2. Flirting Attributes Color Brand Non-English <img> Image <li> enumeration : Labels Disagreement in labels!(label noise) Augmentation Cloning color, brand
  • 14. multi-class : 1 2 : 1, 2, , binary(boolean) classifier: : 0,1 multi-output(label) : 1 2 1: 1, 2, , 1 2: 1, 2, , 2 : 1, 2, , r multi-output binary(boolean) classifier: 1: 0,1 2: 0,1 Targets correlation: (single, fast model for all targets) Only 3 combinations for (Clear,Concise): (1,0), (1,1), (0,0) |~Clear & Concise|= 0 if ~Clear then ~Concise if Concise then Clear
  • 16. 3. Feature Eng. Extraction Reduction LSA,T-SNE,PCA,SVD Selection STD Correlation X~y Linear(t-test, chi2) Non-linear(mi) Model-driven LinearSVM Feature Engineering
  • 19. 10-Fold Set 1 10-Fold Set 2 10-Fold Set 3 10-Fold Set 4 Base Model Ensemble Model Final Prediction Fold Bagging Fold Bagging Set Fold Bagging BLENDBLEND BLEND BLENDSTACK STACK STACK STACK BLENDBLEND BLEND BLEND BLEND Bagging Models
  • 20. Performance Evaluation SGD: stochastic gradient descent LOR: logistic regression RDG: ridge regression NBC: naive bayes classifier XGB: extreme gradient boosting LGB: light gradient boosting W2V: word2vec

Editor's Notes

  • #4: On Lazada, we have millions of products across thousands of categories. To stand out from the crowd, sellers employ creative, sometimes disruptive efforts to improve their search relevancy or attract the attention of customers. Product titles like this degenerate user experience by cluttering the site with irrelevant, misleading titles. In this challenge, we provide you with a set of product titles, description, and attributes, together with the associated title quality scores (clarity and conciseness) as labeled by our internal QC team. Your task is to build a product title quality model that can automatically grade the clarity and the conciseness of a product title. judging a book by its cover
  • #6: On Lazada, we have millions of products across thousands of categories. To stand out from the crowd, sellers employ creative, sometimes disruptive efforts to improve their search relevancy or attract the attention of customers. Product titles like this degenerate user experience by cluttering the site with irrelevant, misleading titles. In this challenge, we provide you with a set of product titles, description, and attributes, together with the associated title quality scores (clarity and conciseness) as labeled by our internal QC team. Your task is to build a product title quality model that can automatically grade the clarity and the conciseness of a product title. judging a book by its cover
  • #15: Contraposition Use one target as a feature for the other one. But has problem in practice since we dont have the validation or test sets label.
  • #17: Plus the attributes, we extract more features from the textual attributes, title and short_description stability selection recursive feature elimination and cross-validation