This document provides an overview of research on search and analytics using semantic annotations. It discusses using named entities, geographic references, and temporal expressions annotated in documents as common semantic annotations. Various models are used to make sense of these annotations for search, like identifying interesting time intervals for queries and diversifying search results by time. Analytics applications include linking Wikipedia to news archives using semantic divergence and generating event digests by selecting relevant sentences from documents.
1 of 31
Download to read offline
More Related Content
Search & Analytics in Archives using Semantic Annotations
1. Search & Analytics
in Archives using
Semantic Annotations
Klaus Berberich
(kberberi@mpi-inf.mpg.de)
2. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 2
Team / About
犢 Research area Text + Time Search & Analytics at MPI-INF
+ close collaboration with Jannik Str旦tgen and others
犢 Focus: Ef鍖cient and effective methods to search and analyze
text collections with temporal information
(e.g., temporal expressions, timestamps)
Kai Hui Arunav Mishra Dhruv Guptame
3. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 3
Motivation
犢 Idea of this talk is to give you an overview of our research
during the past two-and-a-half years without
diving too deep into technical details
犢 Semantic annotations as common ground of all our methods
犢 What are they? How do we obtain and process them?
犢 Which models do we use to make sense of them?
犢 Each of our recent papers presented on at most two slides
(no technical details, but feel free to ask technical questions)
7. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 7
Semantic Annotations
犢 Our recent methods leverage semantic annotations to make
documents more than sequences of text tokens
犢 (disambiguated) mentions of named entities (e.g., persons)
犢 geographic references (e.g., cities or countries)
犢 temporal expressions (e.g., to past dates)
犢 We rely on existing off-the-shelf tools and some handcrafted
rules to obtain semantic annotations for documents
8. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 8
(Disambiguated) Named Entities
犢 AIDA (https://www.ambiverse.com) as our named entity
recognition and disambiguation (NERD) tool of choice
In Indianola, Mississippi, , a new museum will begin construction in June to
honor a local boy who has become international star, the blues performer B.B. King.
The B.B. King Museum and Delta Interpretive Center will begin with
restoration of the cotton mill, , and is expected to be open by 2007, with additions
following. Artifacts from Mr. King's 60-year career will be housed in the museum and
the interpretive center will focus on educational, cultural and character development
programs for Mississippi youth.
9. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 9
(Disambiguated) Named Entities
犢 Named entities are anchored in the YAGO knowledge graph,
which provides us with additional information about them
犢 semantic types (from WordNet and Wikipedia)
犢 surface forms, keyphrases, and links
犢 general facts
wordnet_guitarist
American_Blues_Guitarist
Riley B. King
Bluesboy
isLocatedIn
hasLatitude 33.43属
hasLongitude -90.63属
10. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 10
Geographic References
犢 Geographic references derived from named entity mentions
by considering only entities having type yagoGeoEntity
犢 Minimum-bounding rectangle (MBR) indicating geographic
extent of a location determined by making use of isLocatedIn
relationship and geographic coordinates from YAGO
isLocatedIn
hasLatitude 33.43属
hasLongitude -90.63属
11. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 11
Geographic References
犢 Outlier removal found to be helpful
(e.g., Hawaii or Alaska)
犢 Open question whether richer
representation (e.g., convex hull)
would be bene鍖cial
isLocatedIn
hasLatitude 33.43属
hasLongitude -90.63属
isLocatedIn
hasLatitude 34.96属
hasLongitude -89.98属
isLocatedIn
hasLatitude 31.00属
hasLongitude -90.64属
12. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 12
Geographic References
犢 Probabilistic generative model for latitude-longitude pairs
based on a set of MBRs (e.g., from document or query)
犢 draw a MBR at uniform random
犢 draw a latitude-longitude pair contained in MBR at uniform random
13. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 13
Temporal Expressions
犢 Temporal expressions annotated using
犢 SUTime (http://nlp.stanford.edu/software/sutime.shtml)
犢 HeidelTime (https://github.com/HeidelTime/heideltime)
犢 Reliable publication dates are crucial for correct resolution
of relative temporal expressions (e.g., last year, next month)
In Indianola, Mississippi, , a new museum will begin construction in June to
honor a local boy who has become international star, the blues performer B.B. King.
The B.B. King Museum and Delta Interpretive Center will begin with
restoration of the cotton mill, , and is expected to be open by 2007, with additions
following. Artifacts from Mr. King's 60-year career will be housed in the museum and
the interpretive center will focus on educational, cultural and character development
programs for Mississippi youth.
14. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 14
Temporal Expressions
犢 Meaning of temporal expressions is often vague (e.g., in June)
犢 capture earliest/latest begin/end time point of any precise
time interval that the temporal expression may refer to
In Indianola, Mississippi, , a new museum will begin construction in June to
honor a local boy who has become international star, the blues performer B.B. King.
The B.B. King Museum and Delta Interpretive Center will begin with
restoration of the cotton mill, , and is expected to be open by 2007, with additions
following. Artifacts from Mr. King's 60-year career will be housed in the museum and
the interpretive center will focus on educational, cultural and character development
programs for Mississippi youth.
[2005/06/01, 2005/06/30, 2005/06/01, 2005/06/30]
[2007/01/01, 2007/12/31, 2007/01/01, 2007/12/31]
15. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 15
Temporal Expressions
犢 Probabilistic generative model for time intervals
based on a set of temporal expressions
(e.g., from document of query)
犢 draw a temporal expression (quadruple) at uniform random
犢 draw a time interval that the temporal expression may refer to
begin time point
endtimepoint
05 06 07
050607
17. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 17
Identifying Interesting Time Intervals
犢 Idea: Given a keyword query (e.g., world war), determine
interesting time intervals (e.g., [1914, 1918] or [1939, 1945])
that help the user re鍖ne the query to explore results
D. Gupta and K. Berberich: Identifying Time Intervals of Interest to Queries,
CIKM 2014
P [ [tb, te] | q ] =
耽
d top(q,k)
P [ [tb, te] | d ] 揃 P [ d | q ]
18. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 17
Identifying Interesting Time Intervals
犢 Idea: Given a keyword query (e.g., world war), determine
interesting time intervals (e.g., [1914, 1918] or [1939, 1945])
that help the user re鍖ne the query to explore results
D. Gupta and K. Berberich: Identifying Time Intervals of Interest to Queries,
CIKM 2014
P [ [tb, te] | q ] =
耽
d top(q,k)
P [ [tb, te] | d ] 揃 P [ d | q ]
Generative model
based on
temporal expressions
19. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 17
Identifying Interesting Time Intervals
犢 Idea: Given a keyword query (e.g., world war), determine
interesting time intervals (e.g., [1914, 1918] or [1939, 1945])
that help the user re鍖ne the query to explore results
D. Gupta and K. Berberich: Identifying Time Intervals of Interest to Queries,
CIKM 2014
P [ [tb, te] | q ] =
耽
d top(q,k)
P [ [tb, te] | d ] 揃 P [ d | q ]
Generative model
based on
temporal expressions
Document likelihood
~
query likelihood
20. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 18
Temporal Diversi鍖cation of Search Results
犢 Idea: Re-rank documents to cover all interesting time intervals
(e.g., [1914, 1918] or [1933, 1945]) previously identi鍖ed
for a keyword query (e.g., world war)
D. Gupta and K. Berberich: Diversifying Search Results Using Time,
ECIR 2016
arg max
R : |R|=k
耽
[tb,te]
A
P [ [tb, te] | q ]
A
1
展
d R
(1 P [ [tb, te] | d ] 揃 P [ d | q ])
BB
21. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 18
Temporal Diversi鍖cation of Search Results
犢 Idea: Re-rank documents to cover all interesting time intervals
(e.g., [1914, 1918] or [1933, 1945]) previously identi鍖ed
for a keyword query (e.g., world war)
D. Gupta and K. Berberich: Diversifying Search Results Using Time,
ECIR 2016
Probability that
time interval [tb,te]
is interesting for query q
arg max
R : |R|=k
耽
[tb,te]
A
P [ [tb, te] | q ]
A
1
展
d R
(1 P [ [tb, te] | d ] 揃 P [ d | q ])
BB
22. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 18
Temporal Diversi鍖cation of Search Results
犢 Idea: Re-rank documents to cover all interesting time intervals
(e.g., [1914, 1918] or [1933, 1945]) previously identi鍖ed
for a keyword query (e.g., world war)
D. Gupta and K. Berberich: Diversifying Search Results Using Time,
ECIR 2016
Probability that user sees
at least one document in R
that covers [tb,te]
Probability that
time interval [tb,te]
is interesting for query q
arg max
R : |R|=k
耽
[tb,te]
A
P [ [tb, te] | q ]
A
1
展
d R
(1 P [ [tb, te] | d ] 揃 P [ d | q ])
BB
23. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 19
Linking Wikipedia and News Archives
犢 Idea: Given an excerpt from Wikipedia, automatically retrieve
articles from a news archive that provide in-depth information
犢 Divergence-based re-ranking
of top-K documents
A. Mishra and K. Berberich: Leveraging Semantic Annotations to Link Wikipedia
and News Archives, ECIR 2016
The same month, a groundbreaking was held for a new museum,
dedicated to King,[44] in Indianola, Mississippi.[45] The B.B. King Museum
and Delta Interpretive Center opened on September 13, 2008.[46]
KL(QD) = KL(QtxtDtxt) + KL(QtimeDtime)
+KL(QgeoDgeo) + KL(QentityDentity)
25. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 21
Generating Event Digests
犢 Idea: Given an event query (e.g., from a Wikipedia year page),
automatically generate a summary of the event
having a user-speci鍖ed length
犢 ILP selects sentences from top-k
pseudo-relevant documents
犢 minimize divergence(s) between
sentences and event query
犢 cover all temporal expressions,
geographic references, and
named entities from event query
犢 adhere to length budget
A. Mishra and K. Berberich: Event Digest: A Holistic View on Past Events,
SIGIR 2016
1999/08/30 East Timor votes for independence
from Indonesia in referendum
Figure 1: Example of an event digest.
Event Query
Description: East Timor votes for independence from Indonesia in a referendum
Time: August 30, 1999
Geolocations: 1643084; 1623843; 7289708;
Entities: East_Timor; Indonesia; 1999_East_Timorese_crisis;
Event Digest (with chronological ordering on publication dates)
Publication Date: July 20, 1999 Source Link: http://goo.gl/rJYDiZ
(1) Indonesia is preparing to relinquish control of East Timor after 23 years of
occupation and it believes that independence advocates are highly likely to win a
referendum next month says an authentic internal government report that has been
made available to reporters by advocates of independence. (2) Late next month
estimated 400 000 East Timorese are to choose between broad autonomy within
Indonesia option 1 or independence option 2.
Publication Date: August 29, 1999 Source Link: http://goo.gl/Cz6Jkk
(3) Former president Jimmy Carter whose human rights and diplomacy organi-
zation the Carter Center is monitoring the referendum here said this this month
some top representatives of the government of Indonesia have failed to ful鍖ll their
main obligations with regard to public order and security.
Publication Date: November 21, 1999 Source Link: http://goo.gl/hdqYm8
(4) The last time it was East Timor which voted for independence from Indonesia
in August only to be plunged into a spasm of violence that required an Australian
led international military force to quell it. (5) Acehs latest push for independence
began with the fall of President Suharto in May 1998 and accelerated after the
East Timor referendum.
Publication Date: September 24, 2000 Source Link: http://goo.gl/AijWVY
(6) East Timor has been under a transitional United Nations administration since
the Aug. 30 independence vote last year. (7) The groups pillaged East Timor
after last years independence vote which freed the territory from military control.
Publication Date: August 24, 2001 Source Link: http://goo.gl/EAGBxC
(8) This vote like the referendum in 1999 is being organized by the United Na-
tions which has continued to administer East Timor a former Portuguese colony
annexed by Indonesia as it struggles to its feet economically and politically.
to it. An event digest in such a case becomes an intermediate level
of linking that presents a holistic view. Excerpts from Wikipedia,
an abstract view, are connected to excerpts in the digest which are
in-turn connected to news articles that give a detailed view as il-
lustrated in Figure 2. As other use cases, since event digests are
26. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 22
Mining Events
犢 Idea: Identify and describe important events for a given query
犢 Clustering of sentences from top-k pseudo-relevant documents
based on a Dirichlet process mixture model (DPM) with
resulting clusters as events described by probability distributions
over keywords, time intervals, geographic coordinates,
and named entities
D. Gupta, J. Str旦tgen, and K. Berberich: EventMiner: Mining Events from Annotated
Documents, ICTIR 2016
lord of the rings movie
Entities [YAGO:United_States] [YAGO:Al_Gore]
Table 10: An event cluster for query us presidential elections. The cluster points to the US Presidential Elections in 2000 for which A
Gore ran as vice president.
Keywords [soviet][afghanistan][war][military][beginning][party][forces][union][exhibition][mixed]
Time [01-Jan-1938 , 01-Jan-1938][01-Jan-1980 , 01-Jan-1980][29-Apr-1988 , 29-Apr-1988]
[01-Jan-1979 , 01-Jan-1979][01-Apr-1988 , 01-Apr-1988][29-Jul-1987 , 29-Jul-1987]
[01-Jan-1950 , 01-Jan-1950]
Locations [YAGO:Soviet_Union] [YAGO:Afghanistan] [YAGO:Moscow] [YAGO:Kabul] [YAGO:United_States]
Entities [YAGO:Soviet_Union] [YAGO:Afghanistan] [YAGO:Mohammad_Najibullah] [YAGO:Moscow] [YAGO:Bosniaks]
[YAGO:Kabul] [YAGO:United_States]
Table 11: An event cluster for query soviet afghanistan war. It depicts the Soviet-Afghanistan con鍖ict that lasted from 1979 to 1989.1
Keywords [lord][rings][top][movie][motion][opinion][pictures][article][elvis][jackson][trilogy][movies]
Time [15-Dec-2002 , 15-Dec-2002][01-Jan-1987 , 01-Jan-1987][25-Jan-2004 , 25-Jan-2004]
[12-Nov-2002 , 12-Nov-2002][01-Jan-2003 , 31-Dec-2003][01-Jan-1982 , 01-Jan-1982]
[11-Jan-2004 , 11-Jan-2004][28-Dec-2002 , 29-Dec-2002][07-Sep-2003 , 07-Sep-2003]
[01-Dec-2003 , 31-Dec-2003]
Locations [YAGO:Weldon,_Northamptonshire] [YAGO:Wellington]
Entities [YAGO:J._R._R._Tolkien] [YAGO:Weldon,_Northamptonshire] [YAGO:Wellington] [YAGO:Carol_Ann_Lee]
[YAGO:Peter_Jackson]
Table 12: An event cluster for query lord of the rings movie. It captures the location where the movie was shot Wellington and th
author of the book on which the movie is based on J. R. R. Tolkien.14
Keywords [iraq][states][united][war][opinion][top][international][relations][defense][armament] [presi-
dent][time][fearful][david]
Time [13-Apr-2006 , 13-Apr-2006][15-Jun-2005 , 15-Jun-2005][16-Jul-2003 , 16-Jul-2003]
[16-Oct-2003 , 16-Oct-2003][30-Jun-2005 , 30-Jun-2005]
Locations [YAGO:New_York_City] [YAGO:Port_Washington,_Wisconsin] [YAGO:Radcliff,_Kentucky] [YAGO:Iraq]
[YAGO:United_States]
Entities [YAGO:Iraq] [YAGO:United_States_Army] [YAGO:Donald_Rumsfeld] [YAGO:United_States_Department_of_Defense]
[YAGO:George_W._Bush] [YAGO:Jim_Folsom] [YAGO:New_York_City] [YAGO:Port_Washington,_Wisconsin]
[YAGO:Radcliff,_Kentucky] [YAGO:United_States]
Table 13: An event cluster for query iraq war. The cluster shows the start of Iraq War in 2003.15
Keywords [iraq][iran][war][oil][international][top][faw][port][east][world][delegate][rafsanjani]
Time [01-Mar-1986 , 31-Mar-1986][01-Sep-1980 , 01-Sep-1980][01-Sep-1980 , 30-Sep-1980]
[01-Jan-1970 , 31-Dec-1970][01-Jan-1980 , 01-Jan-1980][23-Sep-2003 , 23-Sep-2003]
[25-Jan-1991 , 25-Jan-1991][01-Aug-1988 , 31-Aug-1988][17-Mar-2006 , 17-Mar-2006]
[01-Jan-1000 , 01-Jan-1000][01-Jan-1988 , 31-Dec-1988][02-Oct-2003 , 02-Oct-2003]
Locations [YAGO:Iran] [YAGO:Iraq] [YAGO:Geneva]
Entities [YAGO:Iran] [YAGO:Iraq] [YAGO:United_Nations] [YAGO:Akbar_Hashemi_Rafsanjani] [YAGO:Iranian_peoples]
[YAGO:Gulf_War] [YAGO:Geneva] [YAGO:United_Nations_Security_Council] [YAGO:Fao_Landing]
[YAGO:Western_world] [YAGO:Persian_people] [YAGO:Iran-Iraq_War] [YAGO:National_Iraqi_News_Agency]
Table 14: An event cluster for query iraq iran war. It describes the con鍖ict between Iran and Iraq that lasted from 1980 to 1988.16
27. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 23
Estimating Time Models
犢 Idea: Estimate a time model (i.e., probability distribution over
time intervals) for excerpts (e.g., sentences or paragraphs)
that do not contain temporal expressions
A. Mishra and K. Berberich: Estimating Time Models for News Article Excerpts,
CIKM 2016
B.B. King collaborated with Eric Clapton
on the album Riding with the King in 2000
In 2007 B.B. King played at Eric
Claptons second Crossroads Festival
B.B. King played with his long-time
collaborator Eric Clapton
28. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 24
Estimating Time Models
犢 Distribution propagation approach with edge weights based on
text similarity, conceptual similarity, and contextual similarity
A. Mishra and K. Berberich: Estimating Time Models for News Article Excerpts,
CIKM 2016
B.B. King collaborated with Eric Clapton
on the album Riding with the King in 2000
In 2007 B.B. King played at Eric
Claptons second Crossroads Festival
B.B. King played with his long-time
collaborator Eric Clapton
30. / 27Search & Analytics using Semantic Annotations (Klaus Berberich) 26
Outlook
犢 Extension of probabilistic generative model for time intervals
to support different temporal granularities (day, month, year)
犢 Narrative generation for event digests to make them more
natural for humans (e.g., chronological order, co-references)
犢 Index structures for documents with semantic (and linguistic)
annotations from which all our methods can potentially pro鍖t
犢 Word embeddings for semantic annotations to have a uniform
representation of text, temporal expressions, geographic
references, and named entities and simplify our methods