We describe our approach for solution of Author - Paper Identification Challenge: https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge
1 of 12
Downloaded 22 times
More Related Content
KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)
1. KDD Cup 2013
Author – Paper Identification
Challenge (2nd place team)
Dmitry Efimov
Lucas Silva
Benjamin Solecki
2. Approach summary
Goal: find incorrectly
assigned pairs
author-paper
Supervised
machine learning problem
with binary response
Deep
feature engineering
(> 300 features)
Gradient
Boosting Machine
(package gbm in R)
5. Paper features
Count features NLP features
Multiple
source
features
Additional
features
count
keywords
tf-idf
measure
paper’s
duplicates
reverse
features
engineering
6. Author – paper features (1 of 4)
Count
features
Multiple
source
features
Additional
features
Likelihood
features
7. Author – paper features (2 of 4)
Count
features
Additional
features
count of
coauthors with
the same
affiliation
reverse feature
engineering:
year ranking
feature
8. Author – paper features (3 of 4)
Multiple
source
features
how many times
pair author-paper
appeared in the
Microsoft database?
9. Author – paper features (4 of 4)
Likelihood
features
use Lj and Lja
as features
1) use (α∙ Lj + (1−α)∙ Lja) as feature
(shrunken likelihood);
2) mixed-effects models (package lme4
in R) to find α
Lj – likelihood by journal
Lja – likelihood by journal
and author
11. Result and conclusion
• Our MAP score is 0.98144 (the winning
submission score is 0.98259).
• Many algorithms (LambdaRank, LambdaMART,
RankBoost) based on MAP optimization gave less
MAP score than GBM with Bernoulli distribution.
• The idea of feature classification based on
bipartite author-paper graph is very promising.
Analyzing of graph topology can give ideas for
new features.