際際滷

際際滷Share a Scribd company logo
Machine Learning to moderate
ads in real world classified's
business
by Vaibhav Singh & Jaroslaw Szymczak
Agenda
 Moderation problem
 Offline model creation
 feature generation
 feature selection
 data leakage
 the algorithm
 Model evaluation
 Going live with the product
 is your data really big?
 automatic model creation pipeline
 consistent development and production environments
 platform architecture
 performance monitoring
50+
countries
60+ million
new monthly listings
18+ million
unique monthly sellers
What do moderators look for?
Avoidance of payment
Sell another item in paid
listing by changing its
content
Flood site with duplicate
posts to increase
visibility
Create multiple accounts
to bypass free ad per
user limit
Violation of ToS
Add Phone numbers,
Company information on
image rather than in
description or dedicated
fields
Try to sell forbidden
items, very often with
title and description that
try to evade keyword
filters
Miscategorized listings
Item is placed in wrong
category
Item is coming from
legitimate business, but
is marked as coming
from individual
Seek problem in job
offers
Offline model creation
Feature
engineering...
 and selection
Feature selection:
 necessary for some
algorithms, for others -
not so much
 most important features
 avoiding leakage
Feature generation - one-hot-encoding
Feature generation - feature hashing
Feature hashing
 Good when dealing high
dimensional, sparse features --
dimensionality reduction
 Memory efficient
 Cons - Getting back to feature
names is difficult
 Cons - Hash collisions can have
negative effects
Data Leakage
 Remove obvious fields
e.g.: id, account numbers
 Check the importance of
the features for any
unusual observations
 Have hold-out set that you
do not process wrt. target
variable
 Closely monitor live
performance
The algorithm
Desired features:
 state-of-the-art structured
binary problems
 allowing reducing variance
errors (overfitting)
 allowing reducing bias errors
(underfitting)
 has efficient implementation
eXtreme Gradient Boosting (XGBoost)
Source: /JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition
Model evaluation
Machine Learning to moderate ads in real world classified's business
Beyond accuracy
 ROC AUC (Receiver-Operator Curve):
 can be interpreted as concordance probability (i.e. random positive example has the
probability equal to AUC, that its score is higher)
 it is too abstract to use as a standalone quality metric
 does not depend on classes ratio
 PRC AUC (Precision-Recall Curve)
 Depends on data balance
 Is not intuitively interpretable
 Precision @ fixed Recall, Recall @ fixed Precision:
 can be found using thresholding
 they heavily depend on data balance
 they are the best to reflect the business requirements
 and to take into account processing capabilities
(then actually Precision @k is more accurate)
 choose one, and only one as your KPI and others as
constraints
Example ROC for moderation problem
Precision-recall curve example
Precision @recall
Recall @precision
Going live with the product
Is your data
really big?
SVM Light
Data Format
 Memory Efficient.
Features can be created
on one machine and do
not require huge clusters
 Cons - Number of
features is unknown,
store it separately
1 191:-0.44 87214:-0.44 200004:0.20 200012:1 206976:1 206983:-1 207015:1 207017:1 226201:1
1 1738:0.57 130440:-0.57 206999:0.32 207000:28 207001:6 207013:1 207015:1 207017:1 226300:1
0 2812:-0.63 34755:-0.31 206995:2.28 206997:1 206998:2 206999:0.00 207000:1 207001:28 226192:1
1 4019:0.35 206999:0.43 207000:40 207001:18 207013:1 207014:1 207016:1 226261:1
0 8903:0.37 207000:4 207001:14 207013:1 207014:1 207016:1 226262:1
1 5878:-0.27 206995:2.28 206998:1 206999:5.80 207000:1 207001:24 226187:1
Lessons Learnt
 Do not go for distributed learning if you
dont need to
 Choose your tech dependent on data size.
Do not go for hype driven development
 Your machine does not limit, theres cloud
 Ask yourself: Whats the most difficult
problem to scale ?  People
Model Generation Pipeline
Automatic
model creation
pipeline
 Automation makes things
deterministic
 Airflow, Luigi and many others
are good choice for Job
dependency management
Luigi Dashboard
Luigi Task Visualizer
Lessons Learnt
 when you use the output path on your own,
create your output at the very end of the
task
 you can dynamically create dependencies
by yielding the task
 adding workers parameter to your
command parallelizes task that are ready
to be run (e.g. python run.py Task 
--workers 15)
Consistent development
and production
environments
Model Serving Architecture
Flask API
Queue Prediction
Module
Mongo
Monitoring & Stats
Graphite, Grafana
Learning
Module
Scikit
XGBoost
Luigi
Ask Prediction
Return Prediction
Learning Ads
Image Model Serving Architecture
AWS Kinensis
Stream
Incoming
Pictures
Hash Generation
Country Specific Image
Moderation
General Moderation NSFW
Tag and Category
Prediction
Mongo
OLX Site
S3
Models
GPU Clusters
Learning Cluster
TF, Keras, MxNet
Performance monitoring
Model monitoring and management
Lessons Learnt
 Always Batch
Batching will reduce CPU Utilization and the same machines
would be able to handle much more requests
 Modularize, Dockerize and Orchestrate
Containerize your code so that it is transparent to Machine
configurations
 Monitoring
Use a monitoring service
 Choose simple and easy tech
Acknowledgements
 Andrzej Praat
 Wojciech Rybicki
Vaibhav Singh
vaibhav.singh@olx.com
Jaroslaw Szymczak
jaroslaw.szymczak@olx.com
PYDATA BERLIN 2017
July 2nd
, 2017

More Related Content

Machine Learning to moderate ads in real world classified's business

  • 1. Machine Learning to moderate ads in real world classified's business by Vaibhav Singh & Jaroslaw Szymczak
  • 2. Agenda Moderation problem Offline model creation feature generation feature selection data leakage the algorithm Model evaluation Going live with the product is your data really big? automatic model creation pipeline consistent development and production environments platform architecture performance monitoring
  • 3. 50+ countries 60+ million new monthly listings 18+ million unique monthly sellers
  • 4. What do moderators look for? Avoidance of payment Sell another item in paid listing by changing its content Flood site with duplicate posts to increase visibility Create multiple accounts to bypass free ad per user limit Violation of ToS Add Phone numbers, Company information on image rather than in description or dedicated fields Try to sell forbidden items, very often with title and description that try to evade keyword filters Miscategorized listings Item is placed in wrong category Item is coming from legitimate business, but is marked as coming from individual Seek problem in job offers
  • 6. Feature engineering... and selection Feature selection: necessary for some algorithms, for others - not so much most important features avoiding leakage
  • 7. Feature generation - one-hot-encoding
  • 8. Feature generation - feature hashing
  • 9. Feature hashing Good when dealing high dimensional, sparse features -- dimensionality reduction Memory efficient Cons - Getting back to feature names is difficult Cons - Hash collisions can have negative effects
  • 10. Data Leakage Remove obvious fields e.g.: id, account numbers Check the importance of the features for any unusual observations Have hold-out set that you do not process wrt. target variable Closely monitor live performance
  • 11. The algorithm Desired features: state-of-the-art structured binary problems allowing reducing variance errors (overfitting) allowing reducing bias errors (underfitting) has efficient implementation
  • 12. eXtreme Gradient Boosting (XGBoost) Source: /JaroslawSzymczak1/xgboost-the-algorithm-that-wins-every-competition
  • 15. Beyond accuracy ROC AUC (Receiver-Operator Curve): can be interpreted as concordance probability (i.e. random positive example has the probability equal to AUC, that its score is higher) it is too abstract to use as a standalone quality metric does not depend on classes ratio PRC AUC (Precision-Recall Curve) Depends on data balance Is not intuitively interpretable Precision @ fixed Recall, Recall @ fixed Precision: can be found using thresholding they heavily depend on data balance they are the best to reflect the business requirements and to take into account processing capabilities (then actually Precision @k is more accurate) choose one, and only one as your KPI and others as constraints
  • 16. Example ROC for moderation problem
  • 20. Going live with the product
  • 22. SVM Light Data Format Memory Efficient. Features can be created on one machine and do not require huge clusters Cons - Number of features is unknown, store it separately 1 191:-0.44 87214:-0.44 200004:0.20 200012:1 206976:1 206983:-1 207015:1 207017:1 226201:1 1 1738:0.57 130440:-0.57 206999:0.32 207000:28 207001:6 207013:1 207015:1 207017:1 226300:1 0 2812:-0.63 34755:-0.31 206995:2.28 206997:1 206998:2 206999:0.00 207000:1 207001:28 226192:1 1 4019:0.35 206999:0.43 207000:40 207001:18 207013:1 207014:1 207016:1 226261:1 0 8903:0.37 207000:4 207001:14 207013:1 207014:1 207016:1 226262:1 1 5878:-0.27 206995:2.28 206998:1 206999:5.80 207000:1 207001:24 226187:1
  • 23. Lessons Learnt Do not go for distributed learning if you dont need to Choose your tech dependent on data size. Do not go for hype driven development Your machine does not limit, theres cloud Ask yourself: Whats the most difficult problem to scale ? People
  • 25. Automatic model creation pipeline Automation makes things deterministic Airflow, Luigi and many others are good choice for Job dependency management
  • 28. Lessons Learnt when you use the output path on your own, create your output at the very end of the task you can dynamically create dependencies by yielding the task adding workers parameter to your command parallelizes task that are ready to be run (e.g. python run.py Task --workers 15)
  • 30. Model Serving Architecture Flask API Queue Prediction Module Mongo Monitoring & Stats Graphite, Grafana Learning Module Scikit XGBoost Luigi Ask Prediction Return Prediction Learning Ads
  • 31. Image Model Serving Architecture AWS Kinensis Stream Incoming Pictures Hash Generation Country Specific Image Moderation General Moderation NSFW Tag and Category Prediction Mongo OLX Site S3 Models GPU Clusters Learning Cluster TF, Keras, MxNet
  • 33. Model monitoring and management
  • 34. Lessons Learnt Always Batch Batching will reduce CPU Utilization and the same machines would be able to handle much more requests Modularize, Dockerize and Orchestrate Containerize your code so that it is transparent to Machine configurations Monitoring Use a monitoring service Choose simple and easy tech
  • 35. Acknowledgements Andrzej Praat Wojciech Rybicki Vaibhav Singh vaibhav.singh@olx.com Jaroslaw Szymczak jaroslaw.szymczak@olx.com PYDATA BERLIN 2017 July 2nd , 2017