The document presents research on developing an adaptive entity linking system called ADEL. It discusses 6 problems in entity linking and proposes research questions to address adaptivity to different text, entity types, knowledge bases, and languages. It describes ADEL's modular framework including extraction, linking, and pruning modules. Evaluation shows ADEL achieves state-of-the-art results on multiple datasets. Future work focuses on knowledge base and language adaptivity, improving the system, and engineering a distributed architecture.
1 of 21
Downloaded 12 times
More Related Content
Knowledge extraction in Web media: at the frontier of NLP, Machine Learning and Semantics
3. Use Case: Bringing Context to Documents
James Patrick Page, OBE (born 9 January 1944)
is an English musician, songwriter, and record
producer who achieved international success as
the guitarist and founder of the rock band Led
Zeppelin. Know More
Sort name: Page, Jimmy
Type: Person
Gender: Male
Born: 1944-01-09 (72 years ago)
Born in: Heston, Hounslow, London,
United Kingdom
Pays dorigine : Royaume-Uni
Genre musical : Blues rock, rock
psych辿d辿lique
Ann辿es actives : 1962-1968 et
depuis 1992
Labels : Columbia
The Yardbirds est un groupe de rock britannique
des ann辿es 1960, form辿 en mai 1963 Londres
en Angleterre dont les guitaristes ont 辿t辿 Eric
Clapton, Jeff Beck puis Jimmy Page. Know More
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 4
4. Six Different Problems
1. Identity of an entity
Arena; Arena (magazine); Arena (TV series)
Bucks County, Pennsylvania; Milwaukee Bucks
2. Knowledge bases have different coverage
Yannick Noah is a
Tennis Player and a
Singer
4. Various types for an
entity (granularity) 5. Different type of
documents
written in multiple
languages
3. High
computation to
handle large
streams
6. Are all phrases
entities? (e.g.
dates or roles)
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 5
5. Research Questions
1. How to adapt an entity linking system depending on
different criteria?
2. How to design an entity linking system in order to
be able to process a large amount of data in near
real time?
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 6
6. State Of The Art
則 The key role of entities:
70% of search queries contain at least one entity [1]
Bring context to videos [2]
Help making summary [3]
則 Current systems (e.g. TagME [3], AIDA [4], Babelfy [5] or DBpedia
Spotlight [6]) are hardly parametrized and often do not propose to be
adapted to at least one of the previous criteria
則 Those solutions are often not able to handle large streams of text
[1] Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010
[2] Jos辿 Luis Redondo Garc鱈a, Giuseppe Rizzo, Rapha谷l Troncy: The Concentric Nature of News Semantic Snapshots: Knowledge
Extraction for Semantic Annotation of News Items. K-CAP 2015
[3] Shruti Chhabra, Srikanta Bedathur: Towards Generating Text Summaries for Entity Chains. ECIR 2014
[4] Paolo Ferragina, Ugo Scaiella: TAGME: on-the-fly annotation of short text fragments (by wikipediaentities). CIKM 2010
[5] Mohamed Amir Yosef, Johannes Hoffart, Ilaria Bordino, Marc Spaniol, Gerhard Weikum: AIDA: AnOnline Tool for Accurate
Disambiguation of Named Entities in Text and Tables. PVLDB 4(12)
[6] Andrea Moro, Alessandro Raganato, Roberto Navigli: Entity Linking meets Word Sense Disambiguation: a Unified Approach.
TACL 2014
[7] Pablo N. Mendes, Max Jakob, Andr辿s Garc鱈a-Silva, Christian Bizer: DBpedia spotlight: shedding light on the web of documents.
I-SEMANTICS 2011
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 7
7. Methodology
We have split up this thesis into six tasks:
Start thesis
Today
End thesis
(1) Text adaptivity
(1) Entity type adaptivity
(1) Knowledge base adaptivity
(1) Language adaptivity
(1- 2) ADEL Modular framework
(2) Distributed and scalable architecture
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 8
8. 則 POS Tagger:
bidirectional
CMM (left to right and
right to left)
則 NER Combiner:
Use a combination of CRF with Gibbs sampling (Monte Carlo as graph inference method)
models. A simple CRF model could be:
PER PER PERO OOO
X X X X XX XXXX
X set of features for the current word: word capitalized, previous word is de, next word is a
NNP, Suppose P(PER | X, PER, O, LOC) = P(PER | X, neighbors(PER)) then X with PER is a CRF
Jimmy Page , connaissant le profesionnalisme de John Paul Jones
ADEL: Modular Framework (Extractors)
PER PERO
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 9
9. ADEL: Modular Framework (Overlap Resolution)
則 Detect overlaps
among extractors
with the boundaries
of the entities
則 Different heuristics can be applied:
Merge: (United States and States of America => United States of
America) default behavior
Simple Substring: (Florence and Florence May Harding => Florence
and May Harding)
Smart Substring: (Giants of New York and New York => Giants and
New York)
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 10
10. Modular Framework: Indexing
則 Create index from
DBpedia and Wikipedia
則 Integrate external data
such as PageRank and
HITS scores from Hasso
Platner Institute
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 11
11. ADEL: Modular Framework (Linking)
則 Generate candidate links for
all extracted mentions:
If any, they go to the linking
method
If not, they are linked to NIL
則 Linking method:
ADEL linear formula:
= . , $$ + . max , + . max , . ()
r(l): the score of the candidate l
L: the Levenshtein distance
m: the extracted mention
title: the title of the candidate l
R: the set of redirect pages associated to the candidate l
D: the set of disambiguation pages associated to the candidate l
PR: Pagerank associated to the candidate l
a, b and c are weights
following the properties:
a > b > c and a + b + c = 1
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 12
12. ADEL: Modular Framework (Pruning)
則 k-NN machine learning
algorithm
則 Why a pruning module?
Useful to correct the errors from the extractor by removing wrong
annotations. Example:
F France played against Russia for a friendly match.
F Yesterday, I went to see Against in concert.
Useful to adapt the annotations in order to follow a given guideline.
Example: suppose we are participating to two different challenges, 2014
NEEL that count the dates as entities, and OKE2015 that do not.
F 1st challenge: Jimmy Page was born the January 9th, 1944.
F 2nd challenge: Jimmy Page was born the January 9th, 1944.
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 13
13. 則 Experiments on different kind of text by
benchmarking ADEL over different challenges
Tweets: NEEL2014, NEEL2015 and NEEL2016
News article: OKE2015 and OKE2016
則 Need to adapt the extractors to use a proper model
to handle different kind of texts
Retrain the NER extractor with a training dataset
Text Adaptivity
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 14
14. Type Adaptivity
則 Challenges have their own definition of types
則 In ADEL types are coming from the NER extractor
and the used knowledge base
NER types are different of KB types
NER types and KB types are different of challenges types
則 Need a mapping between those different types. It is
currently manually made.
OKE2015 and OKE2016 Person, Place, Organization, Role
NEEL2015 and NEEL2016 Person, Location, Organization, Product, Event, Thing
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 15
15. Knowledge Base Adaptivity
則 Joint work with Vrije Universiteit Amsterdam
則 ReCon: define several heuristics in order to re-rank
candidate links provided by our system on newswire
articles
H1: process the article text first and disambiguate the article
title at the end because titles are often too ambiguous
H2: detect co-referential entities throughout the article
H3: topic modeling to exploit a contextual knowledge base
about the found topic
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 16
16. Language Adaptivity
則 No results yet. The goal is to let the user choosing
the natural language used in the text
則 Test the framework on ETAPE which is a NER
challenge on French TV content from 2012
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 17
17. Distributed and Scalable Architecture
則 No results yet. Being able to deploy the framework in
order to run the tasks in a distributed and scalable
way
則 Making each task (extraction, linking and pruning)
independent of each other and put them out of the
global architecture (see how Docker is developed as
model)
則 Stress test the new architecture over large streams
such as Twitter streaming API to detect the possible
bottlenecks
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 18
18. Evaluation Over Multiple Datasets in Linking
則 2014 NEEL Challenge with ADEL v1 using the neleval scorer
則 2015 NEEL Challenge with ADEL v1 using the neleval scorer
則 2016 NEEL Challenge with ADEL v2 using the neleval scorer
則 OKE2015 Challenge with ADEL v1 usingthe GERBIL scorer
則 OKE2016 Challenge with ADEL v2 usingthe neleval scorer
E2E UTwente DataTXT ADEL AIDA Hyberabad SAP
F-measure 70.06 54.93 49.9 46.29 45.37 45.23 39.02
ADEL FOX FRED
F-measure 60.75 49.88 34.73
ousia acubelab ADEL uniba ualberta uva cen_neel
F-measure 76.2 52.3 47.9 46.4 41.5 31.6 0
ADEL kea Insight mit ju unimib
F-measure 61.98 54.86 38,28 36.09 35.48 33.53
ADEL
F-measure 56.5
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 19
19. Conclusions
則 Combining multiple techniques coming from different
domains for entity recognition and linking
則 Having developed different methods in order to make an
entity linking system adaptive to one or multiple criteria
則 Bringing a new approach with ADEL while also reusing
existing approaches with the POS and NER extractors
則 Testing ADEL over different datasets and participating in
challenges
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 20
20. Future Work
則 Knowledge base adaptivity
Further evaluate the knowledge base and text adaptive features using the ERD dataset
Evaluate the knowledge base adaptive feature using the TAC KBP dataset
Experiment the knowledge base adaptive feature using 3cixty and ad-hoc tourism dataset
則 Language adaptivity
Evaluate the language adaptive feature using the ETAPE and TAC KBP datasets
則 Modular Framework
Improving the linking and the pruning with new methods (e.g. evaluate deep learning
methods)
則 Type adaptivity
Further evaluate the approach over more fine grained types using ETAPE challenge. This will
bring more issues especially with the scorers
則 Engineer and evaluate a distributed and scalable architecture on large
data streams
2016/04/14 - PhD Sympoosium WWW 2016 - Montr辿al - 21