KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Aug 17, 2013Download as PPTX, PDF2 likes1,469 views

We describe our approach for solution of Author - Paper Identification Challenge: https://www.kaggle.com/c/kdd-cup-2013-author-paper-identification-challenge

KDD Cup 2013
Author – Paper Identification
Challenge (2nd place team)
Dmitry Efimov
Lucas Silva
Benjamin Solecki

Approach summary
Goal: find incorrectly
assigned pairs
author-paper
Supervised
machine learning problem
with binary response
Deep
feature engineering
(> 300 features)
Gradient
Boosting Machine
(package gbm in R)

Author features
count
journals tf-idf
measure
Count features
NLP features
Multiple
source
features
author’s
duplicates

Paper features
Count features NLP features
Multiple
source
features
Additional
features
count
keywords
tf-idf
measure
paper’s
duplicates
reverse
features
engineering

Author – paper features (1 of 4)
Count
features
Multiple
source
features
Additional
features
Likelihood
features

Author – paper features (2 of 4)
Count
features
Additional
features
count of
coauthors with
the same
affiliation
reverse feature
engineering:
year ranking
feature

Author – paper features (3 of 4)
Multiple
source
features
how many times
pair author-paper
appeared in the
Microsoft database?

Author – paper features (4 of 4)
Likelihood
features
use Lj and Lja
as features
1) use (α∙ Lj + (1−α)∙ Lja) as feature
(shrunken likelihood);
2) mixed-effects models (package lme4
in R) to find α
Lj – likelihood by journal
Lja – likelihood by journal
and author

Model
Gradient Boosting Machine
(package gbm in R)
Grid search for the set
of parameters
83 features in the final model
(out of 300 calculated features )

Result and conclusion
• Our MAP score is 0.98144 (the winning
submission score is 0.98259).
• Many algorithms (LambdaRank, LambdaMART,
RankBoost) based on MAP optimization gave less
MAP score than GBM with Bernoulli distribution.
• The idea of feature classification based on
bipartite author-paper graph is very promising.
Analyzing of graph topology can give ideas for
new features.

This document provides an overview of Lisp Machines and the Genera operating system. It discusses that Lisp Machines had specialized hardware for Lisp data types and features like garbage collection to optimize for Lisp. It also describes that Genera had an open, extensible architecture with data-level integration where all code and data existed in a single shared memory space. Key concepts of Genera included extensibility, reusability, and transparency where the entire system was inspectable and modifiable.

Can functional programming be liberated from static typing?Vsevolod Dyomkin

��

Practical NLP with LispVsevolod Dyomkin

��

This document discusses using Lisp for practical natural language processing (NLP). It begins with an overview of NLP practice, including research work like setting goals, devising algorithms, training models, and testing accuracy. It then discusses some pros and cons of using Lisp for NLP, including its support for interactivity, mathematical foundations, and tree structures. Examples are given of interactive Lisp programs and APIs. The document emphasizes that data is key for NLP and discusses sources for collecting data. It concludes that Lisp is well-suited for NLP research and development due to its interactive and flexible nature.

Basic data analysis using R.C. Tobin Magle

��

Coding and Cookies: R basicsC. Tobin Magle

��

Fox Pro Boot Camp SyllabusRichard Clapp Jr ,CSM

��

Introduction to data analysis using RVictoria López

��

The document introduces R programming and data analysis. It covers getting started with R, data types and structures, exploring and visualizing data, and programming structures and relationships. The aim is to describe in-depth analysis of big data using R and how to extract insights from datasets. It discusses importing and exporting data, data visualization, and programming concepts like functions and apply family functions.

Flat Filer Presentationalibby45

��

This document introduces Flat Filer, a Ruby library for reading and writing fixed-width flat files. It provides a clean interface for unpacking and packing flat file records. The document discusses typical flat file formats and how to define fields, read and write records, and use filters and formatters with Flat Filer. It also provides information on where to find Flat Filer on GitHub and invites questions.

User intent formalization AIware 2024.pdfshuvendulahiri1

��

AINL 2016: Bastrakova, Ledesma, Millan, ZighedLidia Pivovarova

��

This document presents a method for author disambiguation using relational machine learning. It uses a two step validation process applying hierarchical clustering and machine learning models. In the first step, signatures are clustered based on complementing features like focus name, LDA topic and ethnicity. In the second step, machine learning models like random forest, gradient boosting and logistic regression are applied to pairs of signatures to validate the clustering. The best performing model was found to be logistic regression, achieving an F1 score of 98.6% on the first validation and 84.85% on the second validation. Referenced journals distance was the most important feature, followed by author initials. The method provides an effective and scalable way to automatically disambiguate authors

Back to FME School - Day 1: Your Data and FMESafe Software

��

It’s that time of year. The season is changing and FME ‘school’ is now in session! Join us for a series of 9 mini-talks to learn the latest tips for data transformation, see live demos, and get your FME questions answered. Registration gives you access for all three days — sign up now to tune in to the talks you’re most interested in. Course Schedule – Day 1: Your Data and FME 8:00am – FME Workbench Performance Tips & Tricks 8:40am – A Database for Every Occasion 9:20am – Working with Attributes in FME

Author paper identification problem final presentationPooja Mishra

��

This document describes an author paper identification problem where the goal is to determine the correct author for a given paper from a dataset of author information. It discusses preprocessing the data to clean issues and extract relevant features. Random forest and gradient boost models are built and evaluated on test data to solve the problem. Key steps taken include data cleaning, feature engineering from the paper, author and paper-author data, model building using Weka, Mahout and H2O, and evaluating the results using mean average precision.

Multi-label graph analysis and computations using GraphXQingbo Hu

��

Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Skills Matter

��

F# and the DLRRichard Minerich

��

The document discusses F# and the Dynamic Language Runtime (DLR). It provides an overview of F# as a functional-first multi-paradigm language for .NET that allows for imperative and object-oriented programming. It also describes the DLR and how it enables dynamic languages like Python and Ruby to run on .NET and interoperate with existing .NET code. The DLR architecture is explained along with use cases like test-driven development, web development, and embedding dynamic languages in applications.

Understanding Hadoop through examplesYoshitomo Matsubara

��

The document introduces Hadoop and MapReduce concepts through two examples - word count and quiz grading. It provides code samples for mappers and reducers to count word frequencies and calculate student quiz scores in Hadoop. Readers are instructed to run the example code locally to understand how Hadoop partitions and processes large datasets in parallel using a map-reduce model. The goal is for readers to intuitively grasp Hadoop functionality and be able to write their own map-reduce programs for other problems.

Revolutionizing Field Service: How LLMs Are Powering Smarter Knowledge Access...Earley Information Science

��

Revolutionizing Field Service with LLM-Powered Knowledge Management Field service technicians need instant access to accurate repair information, but outdated knowledge systems often create frustrating delays. Large Language Models (LLMs) are changing the game—enhancing knowledge retrieval, streamlining troubleshooting, and reducing technician dependency on senior staff. In this webinar, Seth Earley and industry experts Sanjay Mehta, and Heather Eisenbraun explore how LLMs and Retrieval-Augmented Generation (RAG) are transforming field service operations. Discover how AI-powered knowledge management is improving efficiency, reducing downtime, and elevating service quality. LLMs for Instant Knowledge Retrieval – How AI-driven search dramatically cuts troubleshooting time. Structured Data & AI – Why high-quality, organized knowledge is essential for LLM success. Real-World Implementation – Lessons from deploying LLM-powered knowledge tools in field service. Business Impact – How AI reduces service delays, optimizes workflows, and enhances technician productivity. Empower your field service teams with AI-driven knowledge access. Watch the webinar to see how LLMs are revolutionizing service efficiency.

Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan

��

This is a Quick Research Guide (QRG). QRGs include the following: - A brief, high-level overview of the QRG topic. - A milestone timeline for the QRG topic. - Links to various free online resource materials to provide a deeper dive into the QRG topic. - Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic. QRGs planned for the series: - Artificial Intelligence QRG - Quantum Computing QRG - Big Data Analytics QRG (coming 2025) - Spacecraft Guidance, Navigation & Control QRG (coming 2026) - UK Home Computing & The Birth of ARM QRG (coming 2027) Any questions or comments? - Please contact Arthur Morgan at art_morgan@att.net. 100% human made.

AMER Introduction to ThousandEyes WebinarThousandEyes

��

Caching for Performance Masterclass: Caching at ScaleScyllaDB

��

Transcript: AI in publishing: Your questions answered - Tech Forum 2025BookNet Canada

��

George Walkley, a publishing veteran and leading authority on AI applications, joins us for a follow-up to his presentation "Applying AI to publishing: A balanced and ethical approach". George gives a brief overview of developments since that presentation and answers attendees' pressing questions about AI’s impact and potential applications in the book industry. Link to recording and presentation slides: https://bnctechforum.ca/sessions/ai-in-publishing-your-questions-answered/ Presented by BookNet Canada on February 20, 2025 with support from the Department of Canadian Heritage.

2025-02-27 Tech & Play_ Fun, UX, and Community.pdfkatalinjordans1

��

TrustArc Webinar: State of State Privacy LawsTrustArc

��

The U.S. data privacy landscape is rapidly proliferating, with 20 states enacting comprehensive privacy laws as of November 2024. These laws cover consumer rights, data collection and use including for sensitive data, data security, transparency, and various enforcement mechanisms and penalties for non-compliance. Navigating this patchwork of state-level laws is crucial for businesses to ensure compliance and requires a combination of strategic planning, operational adjustments, and technology to be proactive. Join leading experts from TrustArc, the Future of Privacy Forum, and Venable for an insightful webinar exploring the evolution of state data privacy laws and practical strategies to maintain compliance in 2025. This webinar will review: - A comprehensive overview of each state’s privacy regulations and the latest updates - Practical considerations to help your business achieve regulatory compliance across multiple states - Actionable insights to future-proof your business for 2025

THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIASrivaanchi Nathan

��

This business intelligence report, "The Big Ten Biopharmaceutical MNCs: Global Capability Centers in India", provides an in-depth analysis of the operations and contributions of the Global Capability Centers (GCCs) of ten leading biopharmaceutical multinational corporations in India. The report covers AstraZeneca, Bayer, Bristol Myers Squibb, GlaxoSmithKline (GSK), Novartis, Sanofi, Roche, Pfizer, Novo Nordisk, and Eli Lilly. In this report each company's GCC is profiled with details on location, workforce size, investment, and the strategic roles these centers play in global business operations, research and development, and information technology and digital innovation.

Big Data Analytics Quick Research Guide by Arthur Morgan (PREVIEW)Arthur Morgan

��

What is FinTech A Complete Guide to Financial Technology.pdf Yodaplus Technologies Private Limited

��

It is an in-depth exploration of how technology is transforming the financial sector. Covering the evolution of FinTech from credit cards to AI-driven banking, this guide explains key innovations such as blockchain, DeFi, AI-powered assistants, and central bank digital currencies (CBDCs). Learn how FinTech is enhancing banking, lending, and payments through automation, data analytics, and decentralized solutions. Whether you're a financial professional or just curious about the future of digital finance, this guide offers valuable insights into the rapidly evolving FinTech landscape.

Data-Driven Public Safety: Reliable Data When Every Second CountsSafe Software

��

When every second counts, you need access to data you can trust. In this webinar, we’ll explore how FME empowers public safety services to streamline their operations and safeguard communities. This session will showcase workflow examples that public safety teams leverage every day. We’ll cover real-world use cases and demo workflows, including: Automating Police Traffic Stop Compliance: Learn how the City of Fremont meets traffic stop data standards by automating QA/QC processes, generating error reports – saving over 2,800 hours annually on manual tasks. Anonymizing Crime Data: Discover how cities protect citizen privacy while enabling transparent and trustworthy open data sharing. Next Gen 9-1-1 Integration: Explore how Santa Clara County supports the transition to digital emergency response systems for faster, more accurate dispatching, including automated schema mapping for address standardization. Extreme Heat Alerts: See how FME supports disaster risk management by automating the delivery of extreme heat alerts for proactive emergency response. Our goal is to provide practical workflows and actionable steps you can implement right away. Plus, we’ll provide quick steps to find more information about our public safety subscription for Police, Fire Departments, EMS, HAZMAT teams, and more. Whether you’re in a call center, on the ground, or managing operations, this webinar is crafted to help you leverage data to make informed, timely decisions that matter most.

Caching for Performance Masterclass: Caching StrategiesScyllaDB

��

AI Trends and Fun Demos – Sotheby’s Rehoboth PresentationEthan Holland

��

Ethan B. Holland explores the impact of artificial intelligence on real estate and digital transformation. Covering key AI trends such as multimodal AI, agency, co-pilots, and AI-powered computer usage, the document highlights how emerging technologies are reshaping industries. It includes real-world demonstrations of AI in action, from automated real estate insights to AI-generated voice and video applications. With expertise in digital transformation, Ethan shares insights from his work optimizing workflows with AI tools, automation, and large language models. This presentation is essential for professionals seeking to understand AI’s role in business, automation, and real estate.

Temporary Compound microscope slide .pptxSamir Sharma

��

More Related Content

Similar to KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team) (8)

User intent formalization AIware 2024.pdfshuvendulahiri1

��

AINL 2016: Bastrakova, Ledesma, Millan, ZighedLidia Pivovarova

��

Back to FME School - Day 1: Your Data and FMESafe Software

��

Author paper identification problem final presentationPooja Mishra

��

Multi-label graph analysis and computations using GraphXQingbo Hu

��

Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Skills Matter

��

F# and the DLRRichard Minerich

��

Understanding Hadoop through examplesYoshitomo Matsubara

��

User intent formalization AIware 2024.pdfshuvendulahiri1

��

AINL 2016: Bastrakova, Ledesma, Millan, ZighedLidia Pivovarova

��

Back to FME School - Day 1: Your Data and FMESafe Software

��

Author paper identification problem final presentationPooja Mishra

��

Multi-label graph analysis and computations using GraphXQingbo Hu

��

Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Skills Matter

��

F# and the DLRRichard Minerich

��

Understanding Hadoop through examplesYoshitomo Matsubara

��

Recently uploaded (20)

Revolutionizing Field Service: How LLMs Are Powering Smarter Knowledge Access...Earley Information Science

��

Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan

��

AMER Introduction to ThousandEyes WebinarThousandEyes

��

Caching for Performance Masterclass: Caching at ScaleScyllaDB

��

Transcript: AI in publishing: Your questions answered - Tech Forum 2025BookNet Canada

��

2025-02-27 Tech & Play_ Fun, UX, and Community.pdfkatalinjordans1

��

TrustArc Webinar: State of State Privacy LawsTrustArc

��

THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIASrivaanchi Nathan

��

Big Data Analytics Quick Research Guide by Arthur Morgan (PREVIEW)Arthur Morgan

��

What is FinTech A Complete Guide to Financial Technology.pdf Yodaplus Technologies Private Limited

��

Data-Driven Public Safety: Reliable Data When Every Second CountsSafe Software

��

Caching for Performance Masterclass: Caching StrategiesScyllaDB

��

AI Trends and Fun Demos – Sotheby’s Rehoboth PresentationEthan Holland

��

Temporary Compound microscope slide .pptxSamir Sharma

��

10 FinTech Solutions Every Business Should Know!.pdf Yodaplus Technologies Private Limited

��

FinTech is reshaping the way businesses handle payments, risk management, and financial operations. From AI-driven fraud detection to blockchain-powered security, the right FinTech solutions can streamline processes, reduce costs, and improve decision-making. This guide explores 10 essential FinTech tools that help businesses stay ahead in an increasingly digital economy. Discover how digital payments, credit risk management, treasury solutions, AI, blockchain, and RegTech can enhance efficiency, security, and profitability. Read now to learn how businesses are leveraging FinTech for smarter financial management!

UiPath Automation Developer Associate Training Series 2025 - Session 1DianaGray10

��

Welcome to UiPath Automation Developer Associate Training Series 2025 - Session 1. In this session, we will cover the following topics: Introduction to RPA & UiPath Studio Overview of RPA and its applications Introduction to UiPath Studio Variables & Data Types Control Flows You are requested to finish the following self-paced training for this session: Variables, Constants and Arguments in Studio 2 modules - 1h 30m - https://academy.uipath.com/courses/variables-constants-and-arguments-in-studio Control Flow in Studio 2 modules - 2h 15m - https:/academy.uipath.com/courses/control-flow-in-studio ⁉️ For any questions you may have, please use the dedicated Forum thread. You can tag the hosts and mentors directly and they will reply as soon as possible.

GDG Cloud Southlake #40: Brandon Stokes: How to Build a Great ProductJames Anderson

��

How to Build a Great Product Being a tech entrepreneur is about providing a remarkable product or service that serves the needs of its customers better, faster, and cheaper than anything else. The goal is to "make something people want" which we call, product market fit. But how do we get there? We'll explore the process of taking an idea to product market fit (PMF), how you know you have true PMF, and how your product strategies differ pre-PMF from post-PMF. Brandon is a 3x founder, 1x exit, ex-banker & corporate strategist, car dealership owner, and alumnus of Techstars & Y Combinator. He enjoys building products and services that impact people for the better. Brandon has had 3 different careers (banking, corporate finance & strategy, technology) in 7 different industries; Investment Banking, CPG, Media & Entertainment, Telecommunications, Consumer application, Automotive, & Fintech/Insuretech. He's an idea to revenue leader and entrepreneur that helps organizations build products and processes, hire talent, test & iterate quickly, collect feedback, and grow in unregulated and heavily regulated industries.

Caching for Performance Masterclass: The In-Memory DatastoreScyllaDB

��

UiPath Automation Developer Associate Training Series 2025 - Session 1DianaGray10

��

Blockchain for Businesses Practical Use Cases & Benefits.pdf Yodaplus Technologies Private Limited

��

Blockchain is revolutionizing industries by enhancing security, transparency, and automation. From supply chain management and finance to healthcare and real estate, blockchain eliminates inefficiencies, prevents fraud, and streamlines operations. What You'll Learn in This Presentation: 1. How blockchain enables real-time tracking & fraud prevention 2. The impact of smart contracts & decentralized finance (DeFi) 3. Why businesses should adopt secure and automated blockchain solutions 4. Real-world blockchain applications across multiple industries Explore the future of blockchain and its practical benefits for businesses!

Revolutionizing Field Service: How LLMs Are Powering Smarter Knowledge Access...Earley Information Science

��

Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan

��

AMER Introduction to ThousandEyes WebinarThousandEyes

��

Caching for Performance Masterclass: Caching at ScaleScyllaDB

��

Transcript: AI in publishing: Your questions answered - Tech Forum 2025BookNet Canada

��

2025-02-27 Tech & Play_ Fun, UX, and Community.pdfkatalinjordans1

��

TrustArc Webinar: State of State Privacy LawsTrustArc

��

THE BIG TEN BIOPHARMACEUTICAL MNCs: GLOBAL CAPABILITY CENTERS IN INDIASrivaanchi Nathan

��

Big Data Analytics Quick Research Guide by Arthur Morgan (PREVIEW)Arthur Morgan

��

What is FinTech A Complete Guide to Financial Technology.pdf Yodaplus Technologies Private Limited

��

Data-Driven Public Safety: Reliable Data When Every Second CountsSafe Software

��

Caching for Performance Masterclass: Caching StrategiesScyllaDB

��

AI Trends and Fun Demos – Sotheby’s Rehoboth PresentationEthan Holland

��

Temporary Compound microscope slide .pptxSamir Sharma

��

10 FinTech Solutions Every Business Should Know!.pdf Yodaplus Technologies Private Limited

��

UiPath Automation Developer Associate Training Series 2025 - Session 1DianaGray10

��

GDG Cloud Southlake #40: Brandon Stokes: How to Build a Great ProductJames Anderson

��

Caching for Performance Masterclass: The In-Memory DatastoreScyllaDB

��

UiPath Automation Developer Associate Training Series 2025 - Session 1DianaGray10

��

Blockchain for Businesses Practical Use Cases & Benefits.pdf Yodaplus Technologies Private Limited

��

KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

1. KDD Cup 2013 Author – Paper Identification Challenge (2nd place team) Dmitry Efimov Lucas Silva Benjamin Solecki

2. Approach summary Goal: find incorrectly assigned pairs author-paper Supervised machine learning problem with binary response Deep feature engineering (> 300 features) Gradient Boosting Machine (package gbm in R)

3. Author – Paper graph

4. Author features count journals tf-idf measure Count features NLP features Multiple source features author’s duplicates

5. Paper features Count features NLP features Multiple source features Additional features count keywords tf-idf measure paper’s duplicates reverse features engineering

6. Author – paper features (1 of 4) Count features Multiple source features Additional features Likelihood features

7. Author – paper features (2 of 4) Count features Additional features count of coauthors with the same affiliation reverse feature engineering: year ranking feature

8. Author – paper features (3 of 4) Multiple source features how many times pair author-paper appeared in the Microsoft database?

9. Author – paper features (4 of 4) Likelihood features use Lj and Lja as features 1) use (α∙ Lj + (1−α)∙ Lja) as feature (shrunken likelihood); 2) mixed-effects models (package lme4 in R) to find α Lj – likelihood by journal Lja – likelihood by journal and author

10. Model Gradient Boosting Machine (package gbm in R) Grid search for the set of parameters 83 features in the final model (out of 300 calculated features )

11. Result and conclusion • Our MAP score is 0.98144 (the winning submission score is 0.98259). • Many algorithms (LambdaRank, LambdaMART, RankBoost) based on MAP optimization gave less MAP score than GBM with Bernoulli distribution. • The idea of feature classification based on bipartite author-paper graph is very promising. Analyzing of graph topology can give ideas for new features.

12. Thank you!

�ݺ�ߣ

KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)

Recommended

More Related Content

Similar to KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team) (8)

Recently uploaded (20)

KDD Cup 2013 Track 1 - Author - Paper Identification challenge (2nd place team)