This paper addresses the baffling problem of name disam- biguation in the context of digital libraries that administer bibliographic citations. The problem emanates when multi- ple authors share a common name or when multiple name variations of an author appear in citation records. Name dis- ambiguation is not trivial to solve, and most of the digital libraries do not provide an efficient way to accurately iden- tify the citation records of an author. Furthermore, lack of complete meta-data information in digital libraries hinders the existence of generic algorithm that can be applicable on any dataset. We propose a heuristic-based, unsupervised and adaptive method that also embraces users¨ interaction to count users¨ feedback in disambiguation process. Moreover, the method exploits important features associated with an author and citation records such as co-authors, affiliation, publication title, venue etc., and contrives a conspicuous multilayer hierarchical clustering algorithm, which tunes it- self according to the available information and form clusters of unambiguous records. Our experiments on a set of re- searchers that are contemplated to be highly ambiguous de- cisively produced high precision and recall results and affirm the viability of our algorithm.
1 of 23
Downloaded 30 times
More Related Content
A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries
1. A Real-time Heuristic based Name
Disambiguation Method for Digital Libraries
Muhammad Imran, Syed Zeeshan Haider Gillani, Maurizio Marchese
2. Outline
? Name Disambiguation problem
? Mixed and Split Citations
? Related work
? Our approach
? Experiments & results
? Conclusion
3. Name Disambiguation
Author-1 Author-2 Author-3 Author-4
Muhammad Imran
Multiple authors
share same name
Name variation-1 Name variation-2 Name variation-3
Muhammad Imran M. Imran Imran Muhammad
One author
with multiple
name variations
4. Name Disambiguation Types
M. Imran
Muhammad Imran Malik Imran Mehar Imran
Mixed citations
mixed citation
records
DL
5. Name Disambiguation Types
Muhammad Imran
Author-1 Author-2 Author-3
Split citations
split
citations
DL
split
citations
split
citations
6. Related Work
? Supervision approaches
? Generative (na?ve Bayes)
? Discriminative (Support vector machines)
? Labor-intensive, high training cost
? Unsupervised approaches
? Mostly failed to tackle name variations issue
? No users interventions
7. Our Contributions
? An end-to-end system
? Retrieval -> pre-processing -> disambiguation
? A generic disambiguation approach
? Unsupervised
? Heuristics based
? Involves Users¨ feedback
8. Our Approach
Citation
Records
a cluster
CR
CR
C
R
Cluster
selection
CR
CR
C
R
C
R
C
R
cp
cp
cp
cp
C
R
C
R
C
R
cp
cp
cp
Citation records
containing both mixed
and split
subset of citation records
Discipline based clustering
Co-author based split & building
candidate principal authors' list
Affiliation & candidate authors
based merge
C
R
C
R
c
p
c
p
Title & homepage based merge
Principal
cluster
selection
user selected
CR
pa
user selected
principal cluster
CR
p
a
title based vector
titl
e
titl
e
list of candidate principal authors
principal author
Layer-3 Layer-4 Layer-2 Layer-1
9. Hierarchical Clustering & Feature
Representation
? Approaches
? Agglomerative
Feature matrix (N x D)
? Divisive
Xi,j
N (cols) = No. of citation records
D (rows) = No. of features
jth feature of ith citation record
10. Features: co-authorship
? Joint authors of a book, article ´
? Available across DLs
? We use it as:
? Principal author
? Co-authors
citation
record
{author-1, author-2, author-3, author-4, author-5}
principal author co-authors
11. Features: co-authorship
? Heuristics
^If a co-author appears in two different publications with a same
principal author then most likely both publications belong to the
principal author ̄
citation
record-1
{author-1, author-2, ...}
author-2
THEN
principal author-1
citation
record-2
{author-1, author-2, ...}
author-2
IF =
=
principal author-1
12. Features: Conference Venue
? Venue represents an event name e.g., a
conference, workshop or a journal name.
? Available across DLs.
? Heuristics
^The venues information of two researchers, having same names,
can differentiate one from the other based on examining disciplines
and sub-disciplines information of a researcher's interest. ̄
13. Features: Author¨s Affiliation
? Author¨s affiliation with an institute, university,
organization etc.
? Available across DLs.
? Heuristics
^If two publications with same principal author names, also share
the same affiliation information then both publications will be
considered as belongs to the same author. ̄
14. Features: Authors Names
? An author¨s name can have multiple name
variations.
? For example: Muhammad Imran
? M. Imran
? Imran Muhammad
? Muhammad. I
15. Features: Publications titles
? Title as a String literal
? We maintain a vector of important keywords
? Represents author¨s interests
? Similarity measure between a given citation
records and the vector can be useful
17. Disambiguation System in Action
? Inter-related disciplines based formation of
clusters
? Co-authors based split
? Affiliation based agglomerative
? Pursuit of the remaining bits
18. Inter-related disciplines based formation
of clusters
? Exploits venue/discipline information
? Forms relatively big clusters
? Involves users and consider their selection among
clusters
21. Experiment & Evaluation
Dataset
? 50 most ambiguous researchers
? Manually annotated a golden dataset
? Used DBLP as a data source
? Used ADANA as a base-line approach
? Used Precision, Recall and F1 as performance
measures