This document summarizes the key aspects of winning the Kaggle/Yandex competition to re-rank search results according to personal user preferences. It describes the goal of predicting user pertinence for URLs to improve search rankings. It then outlines the team's approach, which involved constructing many features, using Dataiku Science Studio for modeling, and optimizing models like Random Forest and LambdaMART (which won) to directly improve the NDCG ranking metric. The team worked collaboratively over 9 months to achieve the top ranking.
1 of 13
Downloaded 29 times
More Related Content
What does it take to win the Kaggle/Yandex competition
1. WHAT DOES IT TAKE TO WIN THE
KAGGLE/YANDEX COMPETITION
Christophe Bourguignat
Kenji Lefèvre-Hasegawa
Paul Masurel @Dataiku
Matthieu Scordia @Dataiku
2. OUTLINE OF THE TALK
• Review of the Kaggle/Yandex Challenge
• How we worked (team work & tools)
• The winning model
3. GOAL Re-rank URLs returned by Yandex according to
the personal preferences of the users
url1
url3
url2
url2
GOAL
url3
url1
url4
url4
ML CHALLENGE Predict user’s pertinence
for urls and rerank result set accordingly
The Kaggle/Yandex challenge
4. GIVEN
• 30 days logs test: 3 days, train: 27 days
• Users historic queries, clicks & dwell-times
Q
Q
Q
Q
• Test session prior activity queries, clicks & dwell-times
Test session :
SIZE
• 15Gb size
The Kaggle/Yandex challenge
Q
Q
T
?
5. QUALITY METRIC
• One query test / user on the last 3 days
• NDCG metric penalize error of pertinence on top ranked
urls
• No A/B test
url1
url2
OK
BAD
url4
url3
Kaggle
The Kaggle/Yandex challenge
Prediction
Another ranking
6. TEAM DATAIKU SCIENCE STUDIO / KAGGLE
•
•
•
•
Christophe Bourguignat Engineer, Data enthusiastic
Kenji Lefèvre-Hasegawa Ph.D. math, new to ML
Paul Masurel Software Engineer @dataiku
Matthieu Scordia Data Scientist @dataiku
First meeting : October16th 2013
How we worked (Team work & tools)
8. DATAIKU SCIENCE STUDIO
Features & labels
Features
Labels
Split train & validation
Original train
LEARNING
Team members
work independantly
FEATURES CONSTRUCTION
Team members work
independantly
DATA DRIVEN
COMPUTATION
How we worked (Team work & tools)
9. HOW MUCH WORK ?
• 960+ emails
• 360+ features
• 50+ ideas grid tuned (300+ models fitted)
• Server heavily loaded the last 3 weeks
• 56 kaggle submissions
• 196 teams, 264 players, 3570 submissions
How we worked (Team work & tools)
2014-01-01
1st
Future top 2 & 3
enter race
1 week
3rd
1 week
1st
5th
Top 10
Top 25
1/2 month
1 week
10. PROBLEM ANALYSIS
Query
Result Set
• Rank
• URL Snippet Quality
• URL is skipped, clicked or missed
CLICK
Reading URL
• URL & Domain pertinence with dwell-time
The winning model
11. FEATURES
Features :
• Rank
• User habits, query specificity (entropy, frequency,…)
• Snippet pertinence
• Missed, Skipped, Clicked
• URL & Domain Pertinence
Declinaison of
& Clicked
• Probability, Stimuli freq., Mean Reciprocal Rank (MRR)
• For each user : historic & previous activity in test session &
aggregate
• For all user
• Declined for all queries & same query
The winning model
12. MODELS
• Random Forest (predict proba)
+ maximize E(NDCG)
Kaggle/Yandex Top 1
then 3rd
• Lambda MART (Gradient Boosting Tree
optimized for NDCG) WINS !
The winning model