ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
http://www.lattice.cnrs.fr | Demonstrations at NAACL HLT 2015, Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies, Denver, Colorado (US), May 31-June 5
Expression extractions should be improved and implemented on open source software. The careful use of natural language processing
algorithms could provide better filtering metrics and support in expression merging
The manual filtering is crucial because it allows entities to be reduced to a set size appropriate for analysis, but also recovering
important entities that could have been excluded by the automatic filtering.
Expressed in [1] by social scientists from médialab (Paris Institute of Political Studies, SciencesPo)
OOV IV
LATTICE Lab
CNRS – Ecole Normale Supérieure
U Paris 3 Sorbonne Nouvelle
ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators
Pablo Ruiz, Thierry Poibeau and Frédérique Mélanie
pablo.ruiz.fabo@ens.fr
Our users’ needs in Entity Linking (EL)
o Target users: social science researchers
o Performance of EL systems varies widely depending on corpus
characteristics and types of entities required
o Difficult for users to choose optimal EL system for their corpora
o Our target users wish to filter EL results, making informed
choices about entities to keep and discard
o Public open source tools
o Combine outputs of several tools to get complementary results
o Providing metrics for users to evaluate quality of an annotation
o Simultaneous access to metrics and text to validate annotations
o Besides manual selection, automatic selection also possible via
weighted voting of annotations
The Problem Our Approach
Demo features
TRAFFIC-LIGHT MATRIX FORMAT
o Annotation confidence scores provided by EL services
o Measures of coherence between an entity and the most
representative entities in the corpus
› Wikipedia Link-based Measure: Relatedness between two entities
as a function of Wikipedia pages linking to both and linking to one only
Milne-Witten [3] coherence between entities e1 and e2 (as in Hoffart et al. [4])
› Other possible measures
• Distance between entities’ categories in a Wikipedia
category graph
Corpus: subset of PoliInformatics [2], about 2008 US financial crisis
(1) Query via Search Text displays:
• Document Panel: Documents matching the query
• Entity Panel: Entities extracted in the documents matching the
query displayed on doc. panel, plus:
(2) Confidence Scores for each annotator, normalized to a 0-1
range. (T=Tagme, S=Spotlight, W=Wikipedia Miner).
(3) Coherence score between the entity and a representative
subset of the corpus entities.
(4) Entities not coherent with the corpus are flagged in red.
(5) Query via Search Entities displays:
• Entity Panel: Entities matching the query.
• Document Panel: Documents containing one of the entities
displayed on the entity panel.
(6) Refine Search: Entities can be selected with a list of types
(like ORG) or selected individually with checkboxes.
(7) The Auto-Selection tab shows the output of an automatic
filtering via weighted voting of annotations.
(8) Charts: examples of co-occurrence networks, created offline
exploiting workflow information (sentence number, confidence, …)
0.0
1.0
Scale
DOC.PANELENTITYPANEL 1
5
3
4
6
2
7
8
System workflows
o User always has access to full results, but the workflow can
select a subset of the annotations automatically.
o Workflow combines, via weighted voting, outputs of:
TagMe2, DBpedia Spotlight, Wikipedia Miner, AIDA, Babelfy
o Votes are weighted according to each annotator’s precision on
two reference corpora (IITB and AIDA/CONLL B), depending on
whether user requires annotations for common-noun entity
mentions or not.
on demo not shown on demo
Evaluation
o Automatic EL system combination improved results over each
individual system’s results ([5], our *SEM poster).
o Assessed with strong annotation match and entity match [6] on
four different corpora: AIDA/CONLL B, IITB, MSNBC, AQUAINT.
[1] T. Venturini & D. Guido. 2012. Once upon a text. An ANT [Actor-Network Theory] Tale in Text
Analytics. Sociologica, 3:1-17. Il Mulino, Bologna.
[2] N. Smith et al. 2014. Overview of the 2014 NLP Unshared Task in PoliInformatics. In Proc. ACL
LACSS Workshop.
[3] D. Milne & I. Witten. 2008. An effective, low-cost measure of semantic relatedness obtained from
Wikipedia links. In Proc AAAI WS on Wikipedia and AI.
[4] J. Hoffart et al. 2011. Robust disambiguation of named entities in text. In Proc. EMNLP.
[5] P. Ruiz & T. Poibeau. 2015. Combining open source annotators for entity linking through
weighted voting. In Proc. *SEM.
[6] M. Cornolti, P. Ferragina & M. Ciaramita. (2013). A framework for benchmarking entity-annotation
systems. In Proc. of WWW, 249-260.
Metrics to assist in manual filtering
Annotation voting for automatic filtering
DEMO LINK: http://129.199.228.10/nav/gui/

More Related Content

Entity Linking Combining Open Source Annotators

  • 1. http://www.lattice.cnrs.fr | Demonstrations at NAACL HLT 2015, Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies, Denver, Colorado (US), May 31-June 5 Expression extractions should be improved and implemented on open source software. The careful use of natural language processing algorithms could provide better filtering metrics and support in expression merging The manual filtering is crucial because it allows entities to be reduced to a set size appropriate for analysis, but also recovering important entities that could have been excluded by the automatic filtering. Expressed in [1] by social scientists from médialab (Paris Institute of Political Studies, SciencesPo) OOV IV LATTICE Lab CNRS – Ecole Normale Supérieure U Paris 3 Sorbonne Nouvelle ELCO3: Entity Linking with Corpus Coherence Combining Open Source Annotators Pablo Ruiz, Thierry Poibeau and Frédérique Mélanie pablo.ruiz.fabo@ens.fr Our users’ needs in Entity Linking (EL) o Target users: social science researchers o Performance of EL systems varies widely depending on corpus characteristics and types of entities required o Difficult for users to choose optimal EL system for their corpora o Our target users wish to filter EL results, making informed choices about entities to keep and discard o Public open source tools o Combine outputs of several tools to get complementary results o Providing metrics for users to evaluate quality of an annotation o Simultaneous access to metrics and text to validate annotations o Besides manual selection, automatic selection also possible via weighted voting of annotations The Problem Our Approach Demo features TRAFFIC-LIGHT MATRIX FORMAT o Annotation confidence scores provided by EL services o Measures of coherence between an entity and the most representative entities in the corpus › Wikipedia Link-based Measure: Relatedness between two entities as a function of Wikipedia pages linking to both and linking to one only Milne-Witten [3] coherence between entities e1 and e2 (as in Hoffart et al. [4]) › Other possible measures • Distance between entities’ categories in a Wikipedia category graph Corpus: subset of PoliInformatics [2], about 2008 US financial crisis (1) Query via Search Text displays: • Document Panel: Documents matching the query • Entity Panel: Entities extracted in the documents matching the query displayed on doc. panel, plus: (2) Confidence Scores for each annotator, normalized to a 0-1 range. (T=Tagme, S=Spotlight, W=Wikipedia Miner). (3) Coherence score between the entity and a representative subset of the corpus entities. (4) Entities not coherent with the corpus are flagged in red. (5) Query via Search Entities displays: • Entity Panel: Entities matching the query. • Document Panel: Documents containing one of the entities displayed on the entity panel. (6) Refine Search: Entities can be selected with a list of types (like ORG) or selected individually with checkboxes. (7) The Auto-Selection tab shows the output of an automatic filtering via weighted voting of annotations. (8) Charts: examples of co-occurrence networks, created offline exploiting workflow information (sentence number, confidence, …) 0.0 1.0 Scale DOC.PANELENTITYPANEL 1 5 3 4 6 2 7 8 System workflows o User always has access to full results, but the workflow can select a subset of the annotations automatically. o Workflow combines, via weighted voting, outputs of: TagMe2, DBpedia Spotlight, Wikipedia Miner, AIDA, Babelfy o Votes are weighted according to each annotator’s precision on two reference corpora (IITB and AIDA/CONLL B), depending on whether user requires annotations for common-noun entity mentions or not. on demo not shown on demo Evaluation o Automatic EL system combination improved results over each individual system’s results ([5], our *SEM poster). o Assessed with strong annotation match and entity match [6] on four different corpora: AIDA/CONLL B, IITB, MSNBC, AQUAINT. [1] T. Venturini & D. Guido. 2012. Once upon a text. An ANT [Actor-Network Theory] Tale in Text Analytics. Sociologica, 3:1-17. Il Mulino, Bologna. [2] N. Smith et al. 2014. Overview of the 2014 NLP Unshared Task in PoliInformatics. In Proc. ACL LACSS Workshop. [3] D. Milne & I. Witten. 2008. An effective, low-cost measure of semantic relatedness obtained from Wikipedia links. In Proc AAAI WS on Wikipedia and AI. [4] J. Hoffart et al. 2011. Robust disambiguation of named entities in text. In Proc. EMNLP. [5] P. Ruiz & T. Poibeau. 2015. Combining open source annotators for entity linking through weighted voting. In Proc. *SEM. [6] M. Cornolti, P. Ferragina & M. Ciaramita. (2013). A framework for benchmarking entity-annotation systems. In Proc. of WWW, 249-260. Metrics to assist in manual filtering Annotation voting for automatic filtering DEMO LINK: http://129.199.228.10/nav/gui/