�ݺ�ߣ

Mining Historical Data for DBpedia
via Temporal Tagging
of Wikipedia Infoboxes
Norman Weisenburger, Volha Bryl, Simone Paolo Ponzetto
Data and Web Science Research Group
University of Mannheim
Germany
NLP & DBpedia @ ISWC, Riva del Garda, Italy, October 20, 2014

Outline
1. State of art: Temporally annotated data in DBpedia and LOD
2. Temporally annotated data extraction pipeline
3. Company Dataset
• Statistics
• Comparison with other KBs
1. Ongoing and future work
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 2

Why we need historical LOD
• “Historical data” == any data that is or can be temporally annotated
• population of a city, revenue of a company, current club for a football player
• Why we need such data
• Allows having a more precise description of an entity
• Enables LOD-based data mining for trend prediction
• Availability of temporally annotated data on the Web of Data
• Poor and scarce
• Examples can be found in Freebase, Wikidata, YAGO, …
• Temporally annotated facts or – not so frequently – time series
• Some exceptionally good examples follow…

Temporally annotated data: Examples
Apple Inc. in Wikidata
http.//www.wikidata.org/wiki/Q312

Temporally annotated data: Examples
Apple Inc. in Freebase
http.//www.freebase.com/m/0k8z

Temporally annotated data in DBpedia
• DBpedia's main source of knowledge are Wikipedia infoboxes
• Temporal (time-dependent) infobox attributes
1. Not at all temporally annotated
2. Temporally annotated, annotation is modeled as a separate attribute
3. Temporally annotated, annotation is a part of an attribute value
• Often only the latest value is present
• When new value is available, the old one is overwritten

• DBpedia's main source of knowledge are Wikipedia infoboxes
2. Temporally annotated, annotation is modeled as a separate attribute
3. Temporally annotated, annotation is a part of an attribute value
• Often only the latest value is present
• When new value is available, the old one is overwritten
Our focus: case 3, temporal annotation is a part of an attribute value

(1)

2. Temporally annotated, annotation is modeled as a
separate attribute
• Often lost during DBpedia data extraction
• E.g. no connection between populationTotal and
populationAsOf properties
(2)

separate attribute
• Ends up in DBpedia only if an intermediate
node mapping is defined in the mapping wiki
(2)
(2)

separate attribute
3. Temporally annotated, annotation is a part of an
attribute value
• Annotation is lost during extraction
• In most cases value is regularly overwritten

Idea: go back in time
• Properties of interest
• Temporally annotated, annotation is a part of an attribute value
• Use case: Business and Financial Data (Companies)
• Key observations
• Attribute values are often temporally annotated
• If annotation is part of attribute value DBpedia extraction framework ignores it
• Attribute values are regularly overwritten by Wikipedia editors, but the trace
remains in Wikipedia revision history
• DBpedia data extraction process is run on one (e.g. the latest) dump only
• Proposed solution
• Run extraction on (part of) revision history
• Add a temporal tagger to the process

Extraction pipeline
1. Select and download Wikipedia revisions
2. Extract temporal facts
3. Merge facts
• Code available at https.//github.com/normalerweise/mte

Extraction pipeline
1. Select and download Wikipedia revisions
• Select 4 revisions per year (1st, 2nd, 3rd quartile and the last revision)
• Use MediaWiki API to download the revisions

Extraction pipeline
• Parse each infobox attribute twice
• For a value: Mapping Extractor of the DBpedia Extraction Framework
• For time validity (point or interval): HeidelTime
• HeidelTime is a multilingual cross-domain rule-based temporal tagger
• Developed at the University Of Heidelberg
• http.//dbs.ifi.uni-heidelberg.de/index.php?id=129

Extraction pipeline
{{ Infobox company
| name = Netflix, Inc.
| revenue = US$4.37  million (''FY 2013'')
...
<Netflix, revenue, 4.37E9, usDollar, 2013, 610604061>
• Parse each infobox attribute twice
Revision ID
• For a value: Mapping Extractor of the DBpedia Extraction Framework
• For time validity (point or interval): HeidelTime
• HeidelTime is a multilingual cross-domain rule-based temporal tagger
• Developed at the University Of Heidelberg
• http.//dbs.ifi.uni-heidelberg.de/index.php?id=129

Extraction pipeline
3. Merge facts
• Group triples by subject, property, temporal validity, value
• In case of value conflicts, select the most frequent value
• In case of ties, select the most recent value

Extraction pipeline
<Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234>
3. Merge facts
• Group triples by subject, property, temporal validity, value
• In case of value conflicts, select the most frequent value
• In case of ties, select the most recent value

Data model
• Our choice for RDF representation
• Singleton property approach
• Vinh Nguyen, Olivier Bodenreider, and Amit Sheth. Don’t like RDF reification?
Making statements about statements using singleton property, WWW 2014
• Motivation: performance in terms of #triples, query size and execution time
• Main idea: unique URI for each predicate instance
<Netflix, revenue#uniqueId, 4.37E9>
<revenue#uniqueId, singletonPropertyOf, revenue>
<revenue#uniqueId, date, 2013>
<revenue#uniqueId, sourceRevision, 610604061>

Company dataset
• Dataset available at http://tiny.cc/tmpcompany
• Started from DBpedia resources of type dbpedia-owl:Company and
yago:Company108058098
• 51,214 companies, for 18,489 at least one fact is extracted for
• assets
• equity
• netIncome
• numberOfEmployees
• operatingIncome
• revenue

Company dataset
• assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue

Company dataset vs other KBs
• 10 random companies
with well-maintained
infoboxes
• Manually mapped
ontology properties
• YAGO2
• 0 triples for these
companies for
hasNumberOfPeople
and hasRevenue
Our dataset

Company dataset vs other KBs
• 10 random companies
with well-maintained
infoboxes
• Manually mapped
ontology properties
• YAGO2
• 0 triples for these
companies for
hasNumberOfPeople
and hasRevenue
• Freebase
• 201 vs 58 triples
Our dataset
Freebase

Evaluation
• Evaluating the precision
• (preliminary, not in the paper)
• 100 random tuples, 2 properties, so far only one annotator
• 75% for numberOfEmployees and 78% for revenue
• Caused by parsing errors: DBpedia extraction framework is always tuned
to work with the latest Wikipedia version
• After fixing some errors: 97% for numberOfEmployees and 92% for revenue

Ongoing and future work
• Ongoing: extracting missing attributes from Wikipedia article texts
• Company dataset is used for distant supervision
• Anticipating some questions
• Yes, we tried the approach for another domain: American football
• Yes, making the data available through an endpoint is on our todo list

�ݺ�ߣ

Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes

More Related Content

Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes

Editor's Notes