PyData 2017 conference, Berlin: Co-talk with Vaibhav Singh about our daily work @ OLX Tech Hub Berlin
1 of 35
Downloaded 24 times
More Related Content
Machine Learning to moderate ads in real world classified's business
1. Machine Learning to moderate
ads in real world classified's
business
by Vaibhav Singh & Jaroslaw Szymczak
2. Agenda
Moderation problem
Offline model creation
feature generation
feature selection
data leakage
the algorithm
Model evaluation
Going live with the product
is your data really big?
automatic model creation pipeline
consistent development and production environments
platform architecture
performance monitoring
4. What do moderators look for?
Avoidance of payment
Sell another item in paid
listing by changing its
content
Flood site with duplicate
posts to increase
visibility
Create multiple accounts
to bypass free ad per
user limit
Violation of ToS
Add Phone numbers,
Company information on
image rather than in
description or dedicated
fields
Try to sell forbidden
items, very often with
title and description that
try to evade keyword
filters
Miscategorized listings
Item is placed in wrong
category
Item is coming from
legitimate business, but
is marked as coming
from individual
Seek problem in job
offers
9. Feature hashing
Good when dealing high
dimensional, sparse features --
dimensionality reduction
Memory efficient
Cons - Getting back to feature
names is difficult
Cons - Hash collisions can have
negative effects
10. Data Leakage
Remove obvious fields
e.g.: id, account numbers
Check the importance of
the features for any
unusual observations
Have hold-out set that you
do not process wrt. target
variable
Closely monitor live
performance
15. Beyond accuracy
ROC AUC (Receiver-Operator Curve):
can be interpreted as concordance probability (i.e. random positive example has the
probability equal to AUC, that its score is higher)
it is too abstract to use as a standalone quality metric
does not depend on classes ratio
PRC AUC (Precision-Recall Curve)
Depends on data balance
Is not intuitively interpretable
Precision @ fixed Recall, Recall @ fixed Precision:
can be found using thresholding
they heavily depend on data balance
they are the best to reflect the business requirements
and to take into account processing capabilities
(then actually Precision @k is more accurate)
choose one, and only one as your KPI and others as
constraints
22. SVM Light
Data Format
Memory Efficient.
Features can be created
on one machine and do
not require huge clusters
Cons - Number of
features is unknown,
store it separately
1 191:-0.44 87214:-0.44 200004:0.20 200012:1 206976:1 206983:-1 207015:1 207017:1 226201:1
1 1738:0.57 130440:-0.57 206999:0.32 207000:28 207001:6 207013:1 207015:1 207017:1 226300:1
0 2812:-0.63 34755:-0.31 206995:2.28 206997:1 206998:2 206999:0.00 207000:1 207001:28 226192:1
1 4019:0.35 206999:0.43 207000:40 207001:18 207013:1 207014:1 207016:1 226261:1
0 8903:0.37 207000:4 207001:14 207013:1 207014:1 207016:1 226262:1
1 5878:-0.27 206995:2.28 206998:1 206999:5.80 207000:1 207001:24 226187:1
23. Lessons Learnt
Do not go for distributed learning if you
dont need to
Choose your tech dependent on data size.
Do not go for hype driven development
Your machine does not limit, theres cloud
Ask yourself: Whats the most difficult
problem to scale ? People
28. Lessons Learnt
when you use the output path on your own,
create your output at the very end of the
task
you can dynamically create dependencies
by yielding the task
adding workers parameter to your
command parallelizes task that are ready
to be run (e.g. python run.py Task
--workers 15)
30. Model Serving Architecture
Flask API
Queue Prediction
Module
Mongo
Monitoring & Stats
Graphite, Grafana
Learning
Module
Scikit
XGBoost
Luigi
Ask Prediction
Return Prediction
Learning Ads
31. Image Model Serving Architecture
AWS Kinensis
Stream
Incoming
Pictures
Hash Generation
Country Specific Image
Moderation
General Moderation NSFW
Tag and Category
Prediction
Mongo
OLX Site
S3
Models
GPU Clusters
Learning Cluster
TF, Keras, MxNet
34. Lessons Learnt
Always Batch
Batching will reduce CPU Utilization and the same machines
would be able to handle much more requests
Modularize, Dockerize and Orchestrate
Containerize your code so that it is transparent to Machine
configurations
Monitoring
Use a monitoring service
Choose simple and easy tech
35. Acknowledgements
Andrzej Praat
Wojciech Rybicki
Vaibhav Singh
vaibhav.singh@olx.com
Jaroslaw Szymczak
jaroslaw.szymczak@olx.com
PYDATA BERLIN 2017
July 2nd
, 2017