際際滷

際際滷Share a Scribd company logo
KDD Cup 2013
Author  Paper Identification
Challenge (2nd place team)
Dmitry Efimov
Lucas Silva
Benjamin Solecki
Approach summary
Goal: find incorrectly
assigned pairs
author-paper
Supervised
machine learning problem
with binary response
Deep
feature engineering
(> 300 features)
Gradient
Boosting Machine
(package gbm in R)
Author  Paper graph
Author features
count
journals tf-idf
measure
Count features
NLP features
Multiple
source
features
authors
duplicates
Paper features
Count features NLP features
Multiple
source
features
Additional
features
count
keywords
tf-idf
measure
papers
duplicates
reverse
features
engineering
Author  paper features (1 of 4)
Count
features
Multiple
source
features
Additional
features
Likelihood
features
Author  paper features (2 of 4)
Count
features
Additional
features
count of
coauthors with
the same
affiliation
reverse feature
engineering:
year ranking
feature
Author  paper features (3 of 4)
Multiple
source
features
how many times
pair author-paper
appeared in the
Microsoft database?
Author  paper features (4 of 4)
Likelihood
features
use Lj and Lja
as features
1) use (留 Lj + (1留) Lja) as feature
(shrunken likelihood);
2) mixed-effects models (package lme4
in R) to find 留
Lj  likelihood by journal
Lja  likelihood by journal
and author
Model
Gradient Boosting Machine
(package gbm in R)
Grid search for the set
of parameters
83 features in the final model
(out of 300 calculated features )
Result and conclusion
 Our MAP score is 0.98144 (the winning
submission score is 0.98259).
 Many algorithms (LambdaRank, LambdaMART,
RankBoost) based on MAP optimization gave less
MAP score than GBM with Bernoulli distribution.
 The idea of feature classification based on
bipartite author-paper graph is very promising.
Analyzing of graph topology can give ideas for
new features.
Thank you!

More Related Content

Similar to KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team) (8)

User intent formalization AIware 2024.pdf
User intent formalization AIware 2024.pdfUser intent formalization AIware 2024.pdf
User intent formalization AIware 2024.pdf
shuvendulahiri1
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
Lidia Pivovarova
Back to FME School - Day 1: Your Data and FME
Back to FME School - Day 1: Your Data and FMEBack to FME School - Day 1: Your Data and FME
Back to FME School - Day 1: Your Data and FME
Safe Software
Author paper identification problem final presentation
Author  paper identification problem final presentationAuthor  paper identification problem final presentation
Author paper identification problem final presentation
Pooja Mishra
Multi-label graph analysis and computations using GraphX
Multi-label graph analysis and computations using GraphXMulti-label graph analysis and computations using GraphX
Multi-label graph analysis and computations using GraphX
Qingbo Hu
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Skills Matter
F# and the DLR
F# and the DLRF# and the DLR
F# and the DLR
Richard Minerich
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
Yoshitomo Matsubara
User intent formalization AIware 2024.pdf
User intent formalization AIware 2024.pdfUser intent formalization AIware 2024.pdf
User intent formalization AIware 2024.pdf
shuvendulahiri1
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, ZighedAINL 2016: Bastrakova, Ledesma, Millan, Zighed
AINL 2016: Bastrakova, Ledesma, Millan, Zighed
Lidia Pivovarova
Back to FME School - Day 1: Your Data and FME
Back to FME School - Day 1: Your Data and FMEBack to FME School - Day 1: Your Data and FME
Back to FME School - Day 1: Your Data and FME
Safe Software
Author paper identification problem final presentation
Author  paper identification problem final presentationAuthor  paper identification problem final presentation
Author paper identification problem final presentation
Pooja Mishra
Multi-label graph analysis and computations using GraphX
Multi-label graph analysis and computations using GraphXMulti-label graph analysis and computations using GraphX
Multi-label graph analysis and computations using GraphX
Qingbo Hu
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Skills Matter
Understanding Hadoop through examples
Understanding Hadoop through examplesUnderstanding Hadoop through examples
Understanding Hadoop through examples
Yoshitomo Matsubara

Recently uploaded (20)

Revolutionizing Field Service: How LLMs Are Powering Smarter Knowledge Access...
Revolutionizing Field Service: How LLMs Are Powering Smarter Knowledge Access...Revolutionizing Field Service: How LLMs Are Powering Smarter Knowledge Access...
Revolutionizing Field Service: How LLMs Are Powering Smarter Knowledge Access...
Earley Information Science
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
AMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes WebinarAMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes Webinar
ThousandEyes
Caching for Performance Masterclass: Caching at Scale
Caching for Performance Masterclass: Caching at ScaleCaching for Performance Masterclass: Caching at Scale
Caching for Performance Masterclass: Caching at Scale
ScyllaDB
Transcript: AI in publishing: Your questions answered - Tech Forum 2025
Transcript: AI in publishing: Your questions answered - Tech Forum 2025Transcript: AI in publishing: Your questions answered - Tech Forum 2025
Transcript: AI in publishing: Your questions answered - Tech Forum 2025
BookNet Canada
2025-02-27 Tech & Play_ Fun, UX, and Community.pdf
2025-02-27 Tech & Play_ Fun, UX, and Community.pdf2025-02-27 Tech & Play_ Fun, UX, and Community.pdf
2025-02-27 Tech & Play_ Fun, UX, and Community.pdf
katalinjordans1
TrustArc Webinar: State of State Privacy Laws
TrustArc Webinar: State of State Privacy LawsTrustArc Webinar: State of State Privacy Laws
TrustArc Webinar: State of State Privacy Laws
TrustArc
THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIA
THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIATHE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIA
THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIA
Srivaanchi Nathan
Big Data Analytics Quick Research Guide by Arthur Morgan (PREVIEW)
Big Data Analytics Quick Research Guide by Arthur Morgan (PREVIEW)Big Data Analytics Quick Research Guide by Arthur Morgan (PREVIEW)
Big Data Analytics Quick Research Guide by Arthur Morgan (PREVIEW)
Arthur Morgan
What is FinTech A Complete Guide to Financial Technology.pdf
What is FinTech A Complete Guide to Financial Technology.pdfWhat is FinTech A Complete Guide to Financial Technology.pdf
What is FinTech A Complete Guide to Financial Technology.pdf
Yodaplus Technologies Private Limited
Data-Driven Public Safety: Reliable Data When Every Second Counts
Data-Driven Public Safety: Reliable Data When Every Second CountsData-Driven Public Safety: Reliable Data When Every Second Counts
Data-Driven Public Safety: Reliable Data When Every Second Counts
Safe Software
Caching for Performance Masterclass: Caching Strategies
Caching for Performance Masterclass: Caching StrategiesCaching for Performance Masterclass: Caching Strategies
Caching for Performance Masterclass: Caching Strategies
ScyllaDB
AI Trends and Fun Demos Sothebys Rehoboth Presentation
AI Trends and Fun Demos  Sothebys Rehoboth PresentationAI Trends and Fun Demos  Sothebys Rehoboth Presentation
AI Trends and Fun Demos Sothebys Rehoboth Presentation
Ethan Holland
Temporary Compound microscope slide .pptx
Temporary Compound microscope slide .pptxTemporary Compound microscope slide .pptx
Temporary Compound microscope slide .pptx
Samir Sharma
10 FinTech Solutions Every Business Should Know!.pdf
10 FinTech Solutions Every Business Should Know!.pdf10 FinTech Solutions Every Business Should Know!.pdf
10 FinTech Solutions Every Business Should Know!.pdf
Yodaplus Technologies Private Limited
UiPath Automation Developer Associate Training Series 2025 - Session 1
UiPath Automation Developer Associate Training Series 2025 - Session 1UiPath Automation Developer Associate Training Series 2025 - Session 1
UiPath Automation Developer Associate Training Series 2025 - Session 1
DianaGray10
GDG Cloud Southlake #40: Brandon Stokes: How to Build a Great Product
GDG Cloud Southlake #40: Brandon Stokes: How to Build a Great ProductGDG Cloud Southlake #40: Brandon Stokes: How to Build a Great Product
GDG Cloud Southlake #40: Brandon Stokes: How to Build a Great Product
James Anderson
Caching for Performance Masterclass: The In-Memory Datastore
Caching for Performance Masterclass: The In-Memory DatastoreCaching for Performance Masterclass: The In-Memory Datastore
Caching for Performance Masterclass: The In-Memory Datastore
ScyllaDB
UiPath Automation Developer Associate Training Series 2025 - Session 1
UiPath Automation Developer Associate Training Series 2025 - Session 1UiPath Automation Developer Associate Training Series 2025 - Session 1
UiPath Automation Developer Associate Training Series 2025 - Session 1
DianaGray10
Blockchain for Businesses Practical Use Cases & Benefits.pdf
Blockchain for Businesses Practical Use Cases & Benefits.pdfBlockchain for Businesses Practical Use Cases & Benefits.pdf
Blockchain for Businesses Practical Use Cases & Benefits.pdf
Yodaplus Technologies Private Limited
Revolutionizing Field Service: How LLMs Are Powering Smarter Knowledge Access...
Revolutionizing Field Service: How LLMs Are Powering Smarter Knowledge Access...Revolutionizing Field Service: How LLMs Are Powering Smarter Knowledge Access...
Revolutionizing Field Service: How LLMs Are Powering Smarter Knowledge Access...
Earley Information Science
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
AMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes WebinarAMER Introduction to ThousandEyes Webinar
AMER Introduction to ThousandEyes Webinar
ThousandEyes
Caching for Performance Masterclass: Caching at Scale
Caching for Performance Masterclass: Caching at ScaleCaching for Performance Masterclass: Caching at Scale
Caching for Performance Masterclass: Caching at Scale
ScyllaDB
Transcript: AI in publishing: Your questions answered - Tech Forum 2025
Transcript: AI in publishing: Your questions answered - Tech Forum 2025Transcript: AI in publishing: Your questions answered - Tech Forum 2025
Transcript: AI in publishing: Your questions answered - Tech Forum 2025
BookNet Canada
2025-02-27 Tech & Play_ Fun, UX, and Community.pdf
2025-02-27 Tech & Play_ Fun, UX, and Community.pdf2025-02-27 Tech & Play_ Fun, UX, and Community.pdf
2025-02-27 Tech & Play_ Fun, UX, and Community.pdf
katalinjordans1
TrustArc Webinar: State of State Privacy Laws
TrustArc Webinar: State of State Privacy LawsTrustArc Webinar: State of State Privacy Laws
TrustArc Webinar: State of State Privacy Laws
TrustArc
THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIA
THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIATHE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIA
THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIA
Srivaanchi Nathan
Big Data Analytics Quick Research Guide by Arthur Morgan (PREVIEW)
Big Data Analytics Quick Research Guide by Arthur Morgan (PREVIEW)Big Data Analytics Quick Research Guide by Arthur Morgan (PREVIEW)
Big Data Analytics Quick Research Guide by Arthur Morgan (PREVIEW)
Arthur Morgan
Data-Driven Public Safety: Reliable Data When Every Second Counts
Data-Driven Public Safety: Reliable Data When Every Second CountsData-Driven Public Safety: Reliable Data When Every Second Counts
Data-Driven Public Safety: Reliable Data When Every Second Counts
Safe Software
Caching for Performance Masterclass: Caching Strategies
Caching for Performance Masterclass: Caching StrategiesCaching for Performance Masterclass: Caching Strategies
Caching for Performance Masterclass: Caching Strategies
ScyllaDB
AI Trends and Fun Demos Sothebys Rehoboth Presentation
AI Trends and Fun Demos  Sothebys Rehoboth PresentationAI Trends and Fun Demos  Sothebys Rehoboth Presentation
AI Trends and Fun Demos Sothebys Rehoboth Presentation
Ethan Holland
Temporary Compound microscope slide .pptx
Temporary Compound microscope slide .pptxTemporary Compound microscope slide .pptx
Temporary Compound microscope slide .pptx
Samir Sharma
UiPath Automation Developer Associate Training Series 2025 - Session 1
UiPath Automation Developer Associate Training Series 2025 - Session 1UiPath Automation Developer Associate Training Series 2025 - Session 1
UiPath Automation Developer Associate Training Series 2025 - Session 1
DianaGray10
GDG Cloud Southlake #40: Brandon Stokes: How to Build a Great Product
GDG Cloud Southlake #40: Brandon Stokes: How to Build a Great ProductGDG Cloud Southlake #40: Brandon Stokes: How to Build a Great Product
GDG Cloud Southlake #40: Brandon Stokes: How to Build a Great Product
James Anderson
Caching for Performance Masterclass: The In-Memory Datastore
Caching for Performance Masterclass: The In-Memory DatastoreCaching for Performance Masterclass: The In-Memory Datastore
Caching for Performance Masterclass: The In-Memory Datastore
ScyllaDB
UiPath Automation Developer Associate Training Series 2025 - Session 1
UiPath Automation Developer Associate Training Series 2025 - Session 1UiPath Automation Developer Associate Training Series 2025 - Session 1
UiPath Automation Developer Associate Training Series 2025 - Session 1
DianaGray10

KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

  • 1. KDD Cup 2013 Author Paper Identification Challenge (2nd place team) Dmitry Efimov Lucas Silva Benjamin Solecki
  • 2. Approach summary Goal: find incorrectly assigned pairs author-paper Supervised machine learning problem with binary response Deep feature engineering (> 300 features) Gradient Boosting Machine (package gbm in R)
  • 3. Author Paper graph
  • 4. Author features count journals tf-idf measure Count features NLP features Multiple source features authors duplicates
  • 5. Paper features Count features NLP features Multiple source features Additional features count keywords tf-idf measure papers duplicates reverse features engineering
  • 6. Author paper features (1 of 4) Count features Multiple source features Additional features Likelihood features
  • 7. Author paper features (2 of 4) Count features Additional features count of coauthors with the same affiliation reverse feature engineering: year ranking feature
  • 8. Author paper features (3 of 4) Multiple source features how many times pair author-paper appeared in the Microsoft database?
  • 9. Author paper features (4 of 4) Likelihood features use Lj and Lja as features 1) use (留 Lj + (1留) Lja) as feature (shrunken likelihood); 2) mixed-effects models (package lme4 in R) to find 留 Lj likelihood by journal Lja likelihood by journal and author
  • 10. Model Gradient Boosting Machine (package gbm in R) Grid search for the set of parameters 83 features in the final model (out of 300 calculated features )
  • 11. Result and conclusion Our MAP score is 0.98144 (the winning submission score is 0.98259). Many algorithms (LambdaRank, LambdaMART, RankBoost) based on MAP optimization gave less MAP score than GBM with Bernoulli distribution. The idea of feature classification based on bipartite author-paper graph is very promising. Analyzing of graph topology can give ideas for new features.