�ݺ�ߣ

A Real-time Heuristic based Name
Disambiguation Method for Digital Libraries
Muhammad Imran, Syed Zeeshan Haider Gillani, Maurizio Marchese

Outline
? Name Disambiguation problem
? Mixed and Split Citations
? Related work
? Our approach
? Experiments & results
? Conclusion

Name Disambiguation
Author-1 Author-2 Author-3 Author-4
Muhammad Imran
Multiple authors
share same name
Name variation-1 Name variation-2 Name variation-3
Muhammad Imran M. Imran Imran Muhammad
One author
with multiple
name variations

Name Disambiguation Types
M. Imran
Muhammad Imran Malik Imran Mehar Imran
Mixed citations
mixed citation
records
DL

Name Disambiguation Types
Muhammad Imran
Author-1 Author-2 Author-3
Split citations
split
citations
DL
split
citations
split
citations

Related Work
? Supervision approaches
? Generative (na?ve Bayes)
? Discriminative (Support vector machines)
? Labor-intensive, high training cost
? Unsupervised approaches
? Mostly failed to tackle name variations issue
? No users interventions

Our Contributions
? An end-to-end system
? Retrieval -> pre-processing -> disambiguation
? A generic disambiguation approach
? Unsupervised
? Heuristics based
? Involves Users�� feedback

Our Approach
Citation
Records
a cluster
CR
CR
C
R
Cluster
selection
CR
CR
C
R
C
R
C
R
cp
cp
cp
cp
C
R
C
R
C
R
cp
cp
cp
Citation records
containing both mixed
and split
subset of citation records
Discipline based clustering
Co-author based split & building
candidate principal authors' list
Affiliation & candidate authors
based merge
C
R
C
R
c
p
c
p
Title & homepage based merge
Principal
cluster
selection
user selected
CR
pa
user selected
principal cluster
CR
p
a
title based vector
titl
e
titl
e
list of candidate principal authors
principal author
Layer-3 Layer-4 Layer-2 Layer-1

Hierarchical Clustering & Feature
Representation
? Approaches
? Agglomerative
Feature matrix (N x D)
? Divisive
Xi,j
N (cols) = No. of citation records
D (rows) = No. of features
jth feature of ith citation record

Features: co-authorship
? Joint authors of a book, article ��
? Available across DLs
? We use it as:
? Principal author
? Co-authors
citation
record
{author-1, author-2, author-3, author-4, author-5}
principal author co-authors

Features: co-authorship
? Heuristics
��If a co-author appears in two different publications with a same
principal author then most likely both publications belong to the
principal author��
citation
record-1
{author-1, author-2, ...}
author-2
THEN
principal author-1
citation
record-2
{author-1, author-2, ...}
author-2
IF =
=
principal author-1

Features: Conference Venue
? Venue represents an event name e.g., a
conference, workshop or a journal name.
? Available across DLs.
? Heuristics
��The venues information of two researchers, having same names,
can differentiate one from the other based on examining disciplines
and sub-disciplines information of a researcher's interest.��

Features: Author��s Affiliation
? Author��s affiliation with an institute, university,
organization etc.
? Available across DLs.
? Heuristics
��If two publications with same principal author names, also share
the same affiliation information then both publications will be
considered as belongs to the same author.��

Features: Authors Names
? An author��s name can have multiple name
variations.
? For example: Muhammad Imran
? M. Imran
? Imran Muhammad
? Muhammad. I

Features: Publications titles
? Title as a String literal
? We maintain a vector of important keywords
? Represents author��s interests
? Similarity measure between a given citation
records and the vector can be useful

Features: Principal Author��s Homepage
? Homepage is the URL of an author's
homepage.

Disambiguation System in Action
? Inter-related disciplines based formation of
clusters
? Co-authors based split
? Affiliation based agglomerative
? Pursuit of the remaining bits

Inter-related disciplines based formation
of clusters
? Exploits venue/discipline information
? Forms relatively big clusters
? Involves users and consider their selection among
clusters

Inter-related disciplines based formation
of clusters
? Inter-related disciplines based formation of
clusters

Co-author Based Split
? Using k-means clustering

Experiment & Evaluation
Dataset
? 50 most ambiguous researchers
? Manually annotated a golden dataset
? Used DBLP as a data source
? Used ADANA as a base-line approach
? Used Precision, Recall and F1 as performance
measures

Thank you!
Muhammad Imran
mimran@qf.org.qa

�ݺ�ߣ

A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries

More Related Content

A Real-time Heuristic-based Unsupervised Method for Name Disambiguation in Digital Libraries