ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
KDD Cup 2013
Author – Paper Identification
Challenge (2nd place team)
Dmitry Efimov
Lucas Silva
Benjamin Solecki
Approach summary
Goal: find incorrectly
assigned pairs
author-paper
Supervised
machine learning problem
with binary response
Deep
feature engineering
(> 300 features)
Gradient
Boosting Machine
(package gbm in R)
Author – Paper graph
Author features
count
journals tf-idf
measure
Count features
NLP features
Multiple
source
features
author’s
duplicates
Paper features
Count features NLP features
Multiple
source
features
Additional
features
count
keywords
tf-idf
measure
paper’s
duplicates
reverse
features
engineering
Author – paper features (1 of 4)
Count
features
Multiple
source
features
Additional
features
Likelihood
features
Author – paper features (2 of 4)
Count
features
Additional
features
count of
coauthors with
the same
affiliation
reverse feature
engineering:
year ranking
feature
Author – paper features (3 of 4)
Multiple
source
features
how many times
pair author-paper
appeared in the
Microsoft database?
Author – paper features (4 of 4)
Likelihood
features
use Lj and Lja
as features
1) use (α∙ Lj + (1−α)∙ Lja) as feature
(shrunken likelihood);
2) mixed-effects models (package lme4
in R) to find α
Lj – likelihood by journal
Lja – likelihood by journal
and author
Model
Gradient Boosting Machine
(package gbm in R)
Grid search for the set
of parameters
83 features in the final model
(out of 300 calculated features )
Result and conclusion
• Our MAP score is 0.98144 (the winning
submission score is 0.98259).
• Many algorithms (LambdaRank, LambdaMART,
RankBoost) based on MAP optimization gave less
MAP score than GBM with Bernoulli distribution.
• The idea of feature classification based on
bipartite author-paper graph is very promising.
Analyzing of graph topology can give ideas for
new features.
Thank you!

More Related Content

KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

  • 1. KDD Cup 2013 Author – Paper Identification Challenge (2nd place team) Dmitry Efimov Lucas Silva Benjamin Solecki
  • 2. Approach summary Goal: find incorrectly assigned pairs author-paper Supervised machine learning problem with binary response Deep feature engineering (> 300 features) Gradient Boosting Machine (package gbm in R)
  • 4. Author features count journals tf-idf measure Count features NLP features Multiple source features author’s duplicates
  • 5. Paper features Count features NLP features Multiple source features Additional features count keywords tf-idf measure paper’s duplicates reverse features engineering
  • 6. Author – paper features (1 of 4) Count features Multiple source features Additional features Likelihood features
  • 7. Author – paper features (2 of 4) Count features Additional features count of coauthors with the same affiliation reverse feature engineering: year ranking feature
  • 8. Author – paper features (3 of 4) Multiple source features how many times pair author-paper appeared in the Microsoft database?
  • 9. Author – paper features (4 of 4) Likelihood features use Lj and Lja as features 1) use (α∙ Lj + (1−α)∙ Lja) as feature (shrunken likelihood); 2) mixed-effects models (package lme4 in R) to find α Lj – likelihood by journal Lja – likelihood by journal and author
  • 10. Model Gradient Boosting Machine (package gbm in R) Grid search for the set of parameters 83 features in the final model (out of 300 calculated features )
  • 11. Result and conclusion • Our MAP score is 0.98144 (the winning submission score is 0.98259). • Many algorithms (LambdaRank, LambdaMART, RankBoost) based on MAP optimization gave less MAP score than GBM with Bernoulli distribution. • The idea of feature classification based on bipartite author-paper graph is very promising. Analyzing of graph topology can give ideas for new features.