際際滷

際際滷Share a Scribd company logo
A Real-time Heuristic based Name 
Disambiguation Method for Digital Libraries 
Muhammad Imran, Syed Zeeshan Haider Gillani, Maurizio Marchese
Outline 
? Name Disambiguation problem 
? Mixed and Split Citations 
? Related work 
? Our approach 
? Experiments & results 
? Conclusion
Name Disambiguation 
Author-1 Author-2 Author-3 Author-4 
Muhammad Imran 
Multiple authors 
share same name 
Name variation-1 Name variation-2 Name variation-3 
Muhammad Imran M. Imran Imran Muhammad 
One author 
with multiple 
name variations
Name Disambiguation Types 
M. Imran 
Muhammad Imran Malik Imran Mehar Imran 
Mixed citations 
mixed citation 
records 
DL
Name Disambiguation Types 
Muhammad Imran 
Author-1 Author-2 Author-3 
Split citations 
split 
citations 
DL 
split 
citations 
split 
citations
Related Work 
? Supervision approaches 
? Generative (na?ve Bayes) 
? Discriminative (Support vector machines) 
? Labor-intensive, high training cost 
? Unsupervised approaches 
? Mostly failed to tackle name variations issue 
? No users interventions
Our Contributions 
? An end-to-end system 
? Retrieval -> pre-processing -> disambiguation 
? A generic disambiguation approach 
? Unsupervised 
? Heuristics based 
? Involves Users¨ feedback
Our Approach 
Citation 
Records 
a cluster 
CR 
CR 
C 
R 
Cluster 
selection 
CR 
CR 
C 
R 
C 
R 
C 
R 
cp 
cp 
cp 
cp 
C 
R 
C 
R 
C 
R 
cp 
cp 
cp 
Citation records 
containing both mixed 
and split 
subset of citation records 
Discipline based clustering 
Co-author based split & building 
candidate principal authors' list 
Affiliation & candidate authors 
based merge 
C 
R 
C 
R 
c 
p 
c 
p 
Title & homepage based merge 
Principal 
cluster 
selection 
user selected 
CR 
pa 
user selected 
principal cluster 
CR 
p 
a 
title based vector 
titl 
e 
titl 
e 
list of candidate principal authors 
principal author 
Layer-3 Layer-4 Layer-2 Layer-1
Hierarchical Clustering & Feature 
Representation 
? Approaches 
? Agglomerative 
Feature matrix (N x D) 
? Divisive 
Xi,j 
N (cols) = No. of citation records 
D (rows) = No. of features 
jth feature of ith citation record
Features: co-authorship 
? Joint authors of a book, article ´ 
? Available across DLs 
? We use it as: 
? Principal author 
? Co-authors 
citation 
record 
{author-1, author-2, author-3, author-4, author-5} 
principal author co-authors
Features: co-authorship 
? Heuristics 
^If a co-author appears in two different publications with a same 
principal author then most likely both publications belong to the 
principal author ̄ 
citation 
record-1 
{author-1, author-2, ...} 
author-2 
THEN 
principal author-1 
citation 
record-2 
{author-1, author-2, ...} 
author-2 
IF = 
= 
principal author-1
Features: Conference Venue 
? Venue represents an event name e.g., a 
conference, workshop or a journal name. 
? Available across DLs. 
? Heuristics 
^The venues information of two researchers, having same names, 
can differentiate one from the other based on examining disciplines 
and sub-disciplines information of a researcher's interest. ̄
Features: Author¨s Affiliation 
? Author¨s affiliation with an institute, university, 
organization etc. 
? Available across DLs. 
? Heuristics 
^If two publications with same principal author names, also share 
the same affiliation information then both publications will be 
considered as belongs to the same author. ̄
Features: Authors Names 
? An author¨s name can have multiple name 
variations. 
? For example: Muhammad Imran 
? M. Imran 
? Imran Muhammad 
? Muhammad. I
Features: Publications titles 
? Title as a String literal 
? We maintain a vector of important keywords 
? Represents author¨s interests 
? Similarity measure between a given citation 
records and the vector can be useful
Features: Principal Author¨s Homepage 
? Homepage is the URL of an author's 
homepage.
Disambiguation System in Action 
? Inter-related disciplines based formation of 
clusters 
? Co-authors based split 
? Affiliation based agglomerative 
? Pursuit of the remaining bits
Inter-related disciplines based formation 
of clusters 
? Exploits venue/discipline information 
? Forms relatively big clusters 
? Involves users and consider their selection among 
clusters
Inter-related disciplines based formation 
of clusters 
? Inter-related disciplines based formation of 
clusters
Co-author Based Split 
? Using k-means clustering
Experiment & Evaluation 
Dataset 
? 50 most ambiguous researchers 
? Manually annotated a golden dataset 
? Used DBLP as a data source 
? Used ADANA as a base-line approach 
? Used Precision, Recall and F1 as performance 
measures
Experiment & Evaluation
Thank you! 
Muhammad Imran 
mimran@qf.org.qa

More Related Content

A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

  • 1. A Real-time Heuristic based Name Disambiguation Method for Digital Libraries Muhammad Imran, Syed Zeeshan Haider Gillani, Maurizio Marchese
  • 2. Outline ? Name Disambiguation problem ? Mixed and Split Citations ? Related work ? Our approach ? Experiments & results ? Conclusion
  • 3. Name Disambiguation Author-1 Author-2 Author-3 Author-4 Muhammad Imran Multiple authors share same name Name variation-1 Name variation-2 Name variation-3 Muhammad Imran M. Imran Imran Muhammad One author with multiple name variations
  • 4. Name Disambiguation Types M. Imran Muhammad Imran Malik Imran Mehar Imran Mixed citations mixed citation records DL
  • 5. Name Disambiguation Types Muhammad Imran Author-1 Author-2 Author-3 Split citations split citations DL split citations split citations
  • 6. Related Work ? Supervision approaches ? Generative (na?ve Bayes) ? Discriminative (Support vector machines) ? Labor-intensive, high training cost ? Unsupervised approaches ? Mostly failed to tackle name variations issue ? No users interventions
  • 7. Our Contributions ? An end-to-end system ? Retrieval -> pre-processing -> disambiguation ? A generic disambiguation approach ? Unsupervised ? Heuristics based ? Involves Users¨ feedback
  • 8. Our Approach Citation Records a cluster CR CR C R Cluster selection CR CR C R C R C R cp cp cp cp C R C R C R cp cp cp Citation records containing both mixed and split subset of citation records Discipline based clustering Co-author based split & building candidate principal authors' list Affiliation & candidate authors based merge C R C R c p c p Title & homepage based merge Principal cluster selection user selected CR pa user selected principal cluster CR p a title based vector titl e titl e list of candidate principal authors principal author Layer-3 Layer-4 Layer-2 Layer-1
  • 9. Hierarchical Clustering & Feature Representation ? Approaches ? Agglomerative Feature matrix (N x D) ? Divisive Xi,j N (cols) = No. of citation records D (rows) = No. of features jth feature of ith citation record
  • 10. Features: co-authorship ? Joint authors of a book, article ´ ? Available across DLs ? We use it as: ? Principal author ? Co-authors citation record {author-1, author-2, author-3, author-4, author-5} principal author co-authors
  • 11. Features: co-authorship ? Heuristics ^If a co-author appears in two different publications with a same principal author then most likely both publications belong to the principal author ̄ citation record-1 {author-1, author-2, ...} author-2 THEN principal author-1 citation record-2 {author-1, author-2, ...} author-2 IF = = principal author-1
  • 12. Features: Conference Venue ? Venue represents an event name e.g., a conference, workshop or a journal name. ? Available across DLs. ? Heuristics ^The venues information of two researchers, having same names, can differentiate one from the other based on examining disciplines and sub-disciplines information of a researcher's interest. ̄
  • 13. Features: Author¨s Affiliation ? Author¨s affiliation with an institute, university, organization etc. ? Available across DLs. ? Heuristics ^If two publications with same principal author names, also share the same affiliation information then both publications will be considered as belongs to the same author. ̄
  • 14. Features: Authors Names ? An author¨s name can have multiple name variations. ? For example: Muhammad Imran ? M. Imran ? Imran Muhammad ? Muhammad. I
  • 15. Features: Publications titles ? Title as a String literal ? We maintain a vector of important keywords ? Represents author¨s interests ? Similarity measure between a given citation records and the vector can be useful
  • 16. Features: Principal Author¨s Homepage ? Homepage is the URL of an author's homepage.
  • 17. Disambiguation System in Action ? Inter-related disciplines based formation of clusters ? Co-authors based split ? Affiliation based agglomerative ? Pursuit of the remaining bits
  • 18. Inter-related disciplines based formation of clusters ? Exploits venue/discipline information ? Forms relatively big clusters ? Involves users and consider their selection among clusters
  • 19. Inter-related disciplines based formation of clusters ? Inter-related disciplines based formation of clusters
  • 20. Co-author Based Split ? Using k-means clustering
  • 21. Experiment & Evaluation Dataset ? 50 most ambiguous researchers ? Manually annotated a golden dataset ? Used DBLP as a data source ? Used ADANA as a base-line approach ? Used Precision, Recall and F1 as performance measures
  • 23. Thank you! Muhammad Imran mimran@qf.org.qa