Norman Weisenburger, Volha Bryl, Simone Paolo Ponzetto
NLP & DBpedia 2014 Workshop @ ISWC 2014, Riva del Garda, Italy, October 20, 2014
1 of 27
Download to read offline
More Related Content
Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes
1. Mining Historical Data for DBpedia
via Temporal Tagging
of Wikipedia Infoboxes
Norman Weisenburger, Volha Bryl, Simone Paolo Ponzetto
Data and Web Science Research Group
University of Mannheim
Germany
NLP & DBpedia @ ISWC, Riva del Garda, Italy, October 20, 2014
2. Outline
1. State of art: Temporally annotated data in DBpedia and LOD
2. Temporally annotated data extraction pipeline
3. Company Dataset
Statistics
Comparison with other KBs
1. Ongoing and future work
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 2
3. Why we need historical LOD
Historical data == any data that is or can be temporally annotated
population of a city, revenue of a company, current club for a football player
Why we need such data
Allows having a more precise description of an entity
Enables LOD-based data mining for trend prediction
Availability of temporally annotated data on the Web of Data
Poor and scarce
Examples can be found in Freebase, Wikidata, YAGO,
Temporally annotated facts or not so frequently time series
Some exceptionally good examples follow
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 3
4. Temporally annotated data: Examples
Apple Inc. in Wikidata
http.//www.wikidata.org/wiki/Q312
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 4
5. Temporally annotated data: Examples
Apple Inc. in Freebase
http.//www.freebase.com/m/0k8z
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 5
6. Temporally annotated data in DBpedia
DBpedia's main source of knowledge are Wikipedia infoboxes
Temporal (time-dependent) infobox attributes
1. Not at all temporally annotated
2. Temporally annotated, annotation is modeled as a separate attribute
3. Temporally annotated, annotation is a part of an attribute value
Often only the latest value is present
When new value is available, the old one is overwritten
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 6
7. Temporally annotated data in DBpedia
DBpedia's main source of knowledge are Wikipedia infoboxes
Temporal (time-dependent) infobox attributes
1. Not at all temporally annotated
2. Temporally annotated, annotation is modeled as a separate attribute
3. Temporally annotated, annotation is a part of an attribute value
Often only the latest value is present
When new value is available, the old one is overwritten
Our focus: case 3, temporal annotation is a part of an attribute value
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 7
8. Temporally annotated data in DBpedia
Temporal (time-dependent) infobox attributes
1. Not at all temporally annotated
(1)
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 8
9. Temporally annotated data in DBpedia
Temporal (time-dependent) infobox attributes
1. Not at all temporally annotated
2. Temporally annotated, annotation is modeled as a
separate attribute
Often lost during DBpedia data extraction
E.g. no connection between populationTotal and
populationAsOf properties
(2)
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 9
10. Temporally annotated data in DBpedia
Temporal (time-dependent) infobox attributes
1. Not at all temporally annotated
2. Temporally annotated, annotation is modeled as a
separate attribute
Ends up in DBpedia only if an intermediate
node mapping is defined in the mapping wiki
(2)
(2)
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 10
11. Temporally annotated data in DBpedia
Temporal (time-dependent) infobox attributes
1. Not at all temporally annotated
2. Temporally annotated, annotation is modeled as a
separate attribute
Ends up in DBpedia only if an intermediate
node mapping is defined in the mapping wiki
(2)
(2)
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 11
12. Temporally annotated data in DBpedia
Temporal (time-dependent) infobox attributes
1. Not at all temporally annotated
2. Temporally annotated, annotation is modeled as a
separate attribute
3. Temporally annotated, annotation is a part of an
attribute value
Annotation is lost during extraction
In most cases value is regularly overwritten
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 12
13. Idea: go back in time
Properties of interest
Temporally annotated, annotation is a part of an attribute value
Use case: Business and Financial Data (Companies)
Key observations
Attribute values are often temporally annotated
If annotation is part of attribute value DBpedia extraction framework ignores it
Attribute values are regularly overwritten by Wikipedia editors, but the trace
remains in Wikipedia revision history
DBpedia data extraction process is run on one (e.g. the latest) dump only
Proposed solution
Run extraction on (part of) revision history
Add a temporal tagger to the process
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 14
14. Extraction pipeline
1. Select and download Wikipedia revisions
2. Extract temporal facts
3. Merge facts
Code available at https.//github.com/normalerweise/mte
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 15
15. Extraction pipeline
1. Select and download Wikipedia revisions
Select 4 revisions per year (1st, 2nd, 3rd quartile and the last revision)
Use MediaWiki API to download the revisions
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 16
16. Extraction pipeline
2. Extract temporal facts
Parse each infobox attribute twice
For a value: Mapping Extractor of the DBpedia Extraction Framework
For time validity (point or interval): HeidelTime
HeidelTime is a multilingual cross-domain rule-based temporal tagger
Developed at the University Of Heidelberg
http.//dbs.ifi.uni-heidelberg.de/index.php?id=129
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 17
17. Extraction pipeline
{{ Infobox company
| name = Netflix, Inc.
| revenue = US$4.37 million (''FY 2013'')
...
<Netflix, revenue, 4.37E9, usDollar, 2013, 610604061>
2. Extract temporal facts
Parse each infobox attribute twice
Revision ID
For a value: Mapping Extractor of the DBpedia Extraction Framework
For time validity (point or interval): HeidelTime
HeidelTime is a multilingual cross-domain rule-based temporal tagger
Developed at the University Of Heidelberg
http.//dbs.ifi.uni-heidelberg.de/index.php?id=129
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 18
18. Extraction pipeline
3. Merge facts
Group triples by subject, property, temporal validity, value
In case of value conflicts, select the most frequent value
In case of ties, select the most recent value
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 19
19. Extraction pipeline
<Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234>
<Netflix, operatingIncome, 1.92E8, usDollar, 2009, 387048342>
<Netflix, operatingIncome, 1.94E8, usDollar, 2009, 439282478>
<Netflix, operatingIncome, 1.94E8, usDollar, 2009, 426138580>
<Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234>
<Netflix, operatingIncome, 1.94E8, usDollar, 2009, 439282478>
3. Merge facts
Group triples by subject, property, temporal validity, value
In case of value conflicts, select the most frequent value
In case of ties, select the most recent value
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 20
20. Data model
Our choice for RDF representation
Singleton property approach
Vinh Nguyen, Olivier Bodenreider, and Amit Sheth. Dont like RDF reification?
Making statements about statements using singleton property, WWW 2014
Motivation: performance in terms of #triples, query size and execution time
Main idea: unique URI for each predicate instance
<Netflix, revenue#uniqueId, 4.37E9>
<revenue#uniqueId, singletonPropertyOf, revenue>
<revenue#uniqueId, date, 2013>
<revenue#uniqueId, sourceRevision, 610604061>
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 21
21. Company dataset
Dataset available at http://tiny.cc/tmpcompany
Started from DBpedia resources of type dbpedia-owl:Company and
yago:Company108058098
51,214 companies, for 18,489 at least one fact is extracted for
assets
equity
netIncome
numberOfEmployees
operatingIncome
revenue
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 22
22. Company dataset
Dataset available at http://tiny.cc/tmpcompany
51,214 companies, for 18,489 at least one fact is extracted for
assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 23
23. Company dataset
Dataset available at http://tiny.cc/tmpcompany
51,214 companies, for 18,489 at least one fact is extracted for
assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 24
24. Company dataset vs other KBs
10 random companies
with well-maintained
infoboxes
Manually mapped
ontology properties
YAGO2
0 triples for these
companies for
hasNumberOfPeople
and hasRevenue
Our dataset
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 25
25. Company dataset vs other KBs
10 random companies
with well-maintained
infoboxes
Manually mapped
ontology properties
YAGO2
0 triples for these
companies for
hasNumberOfPeople
and hasRevenue
Freebase
201 vs 58 triples
Our dataset
Freebase
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 26
26. Evaluation
Evaluating the precision
(preliminary, not in the paper)
100 random tuples, 2 properties, so far only one annotator
75% for numberOfEmployees and 78% for revenue
Caused by parsing errors: DBpedia extraction framework is always tuned
to work with the latest Wikipedia version
After fixing some errors: 97% for numberOfEmployees and 92% for revenue
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 27
27. Ongoing and future work
Ongoing: extracting missing attributes from Wikipedia article texts
Company dataset is used for distant supervision
Anticipating some questions
Yes, we tried the approach for another domain: American football
Yes, making the data available through an endpoint is on our todo list
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 28
Editor's Notes
#15: Only the latest value is present in the infobox of interest?
#26: Comparison with DBpedia:
http://dbpedia.org/resource/Apple Inc.
Our dataset contains 45 temporal facts whereas DBpedia currently has one fact for reach relation, i.e. 6 triples
#27: Freebase lists EDGAR as one of its data sources. EDGAR is a database which contains information about publicly traded US companies operated by the United States Security and Exchange Commission.