�ݺ�ߣ

WHAT DOES IT TAKE TO WIN THE
KAGGLE/YANDEX COMPETITION

Christophe Bourguignat
Kenji Lefèvre-Hasegawa
Paul Masurel @Dataiku
Matthieu Scordia @Dataiku

OUTLINE OF THE TALK

• Review of the Kaggle/Yandex Challenge
• How we worked (team work & tools)
• The winning model

GOAL Re-rank URLs returned by Yandex according to
the personal preferences of the users
url1

url3

url2

url2
GOAL

url3

url1

url4

url4

ML CHALLENGE Predict user’s pertinence

for urls and rerank result set accordingly
The Kaggle/Yandex challenge

GIVEN
• 30 days logs test: 3 days, train: 27 days
• Users historic queries, clicks & dwell-times
Q

Q

Q

Q

• Test session prior activity queries, clicks & dwell-times
Test session :

SIZE
• 15Gb size

Q

Q

T

?

QUALITY METRIC
• One query test / user on the last 3 days
• NDCG metric penalize error of pertinence on top ranked
urls

• No A/B test
url1
url2

OK

BAD

url4
url3

Kaggle

Prediction

Another ranking

TEAM DATAIKU SCIENCE STUDIO / KAGGLE

•
•
•
•

Christophe Bourguignat Engineer, Data enthusiastic
Kenji Lefèvre-Hasegawa Ph.D. math, new to ML
Paul Masurel Software Engineer @dataiku
Matthieu Scordia Data Scientist @dataiku
First meeting : October16th 2013

How we worked (Team work & tools)

WE’VE USED
•
•
•
•
•

Related papers (mainly Microsoft’s)
12 core, 64 Gb
Python scikit-learn
Dataiku Science Studio
Java Ranklib


DATAIKU SCIENCE STUDIO
Features & labels

Features

Labels

Split train & validation

Original train

LEARNING
Team members
work independantly

FEATURES CONSTRUCTION
Team members work
independantly
DATA DRIVEN
COMPUTATION

HOW MUCH WORK ?
• 960+ emails
• 360+ features
• 50+ ideas grid tuned (300+ models fitted)
• Server heavily loaded the last 3 weeks
• 56 kaggle submissions
• 196 teams, 264 players, 3570 submissions


2014-01-01

1st

Future top 2 & 3
enter race

1 week

3rd

1 week

1st

5th

Top 10

Top 25

1/2 month

1 week

PROBLEM ANALYSIS
Query

Result Set
• Rank
• URL Snippet Quality
• URL is skipped, clicked or missed

CLICK
Reading URL
• URL & Domain pertinence with dwell-time

The winning model

FEATURES
Features :
• Rank
• User habits, query specificity (entropy, frequency,…)
• Snippet pertinence
• Missed, Skipped, Clicked
• URL & Domain Pertinence
Declinaison of
& Clicked
• Probability, Stimuli freq., Mean Reciprocal Rank (MRR)
• For each user : historic & previous activity in test session &
aggregate
• For all user
• Declined for all queries & same query
The winning model

MODELS
• Random Forest (predict proba)
+ maximize E(NDCG)
Kaggle/Yandex Top 1
then 3rd

• Lambda MART (Gradient Boosting Tree
optimized for NDCG) WINS !
The winning model

�ݺ�ߣ

What does it take to win the Kaggle/Yandex competition

More Related Content

What does it take to win the Kaggle/Yandex competition