際際滷

際際滷Share a Scribd company logo
Mining Historical Data for DBpedia 
via Temporal Tagging 
of Wikipedia Infoboxes 
Norman Weisenburger, Volha Bryl, Simone Paolo Ponzetto 
Data and Web Science Research Group 
University of Mannheim 
Germany 
NLP & DBpedia @ ISWC, Riva del Garda, Italy, October 20, 2014
Outline 
1. State of art: Temporally annotated data in DBpedia and LOD 
2. Temporally annotated data extraction pipeline 
3. Company Dataset 
 Statistics 
 Comparison with other KBs 
1. Ongoing and future work 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 2
Why we need historical LOD 
 Historical data == any data that is or can be temporally annotated 
 population of a city, revenue of a company, current club for a football player 
 Why we need such data 
 Allows having a more precise description of an entity 
 Enables LOD-based data mining for trend prediction 
 Availability of temporally annotated data on the Web of Data 
 Poor and scarce 
 Examples can be found in Freebase, Wikidata, YAGO,  
 Temporally annotated facts or  not so frequently  time series 
 Some exceptionally good examples follow 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 3
Temporally annotated data: Examples 
Apple Inc. in Wikidata 
http.//www.wikidata.org/wiki/Q312 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 4
Temporally annotated data: Examples 
Apple Inc. in Freebase 
http.//www.freebase.com/m/0k8z 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 5
Temporally annotated data in DBpedia 
 DBpedia's main source of knowledge are Wikipedia infoboxes 
 Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
2. Temporally annotated, annotation is modeled as a separate attribute 
3. Temporally annotated, annotation is a part of an attribute value 
 Often only the latest value is present 
 When new value is available, the old one is overwritten 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 6
Temporally annotated data in DBpedia 
 DBpedia's main source of knowledge are Wikipedia infoboxes 
 Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
2. Temporally annotated, annotation is modeled as a separate attribute 
3. Temporally annotated, annotation is a part of an attribute value 
 Often only the latest value is present 
 When new value is available, the old one is overwritten 
Our focus: case 3, temporal annotation is a part of an attribute value 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 7
Temporally annotated data in DBpedia 
 Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
(1) 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 8
Temporally annotated data in DBpedia 
 Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
2. Temporally annotated, annotation is modeled as a 
separate attribute 
 Often lost during DBpedia data extraction 
 E.g. no connection between populationTotal and 
populationAsOf properties 
(2) 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 9
Temporally annotated data in DBpedia 
 Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
2. Temporally annotated, annotation is modeled as a 
separate attribute 
 Ends up in DBpedia only if an intermediate 
node mapping is defined in the mapping wiki 
(2) 
(2) 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 10
Temporally annotated data in DBpedia 
 Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
2. Temporally annotated, annotation is modeled as a 
separate attribute 
 Ends up in DBpedia only if an intermediate 
node mapping is defined in the mapping wiki 
(2) 
(2) 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 11
Temporally annotated data in DBpedia 
 Temporal (time-dependent) infobox attributes 
1. Not at all temporally annotated 
2. Temporally annotated, annotation is modeled as a 
separate attribute 
3. Temporally annotated, annotation is a part of an 
attribute value 
 Annotation is lost during extraction 
 In most cases value is regularly overwritten 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 12
Idea: go back in time 
 Properties of interest 
 Temporally annotated, annotation is a part of an attribute value 
 Use case: Business and Financial Data (Companies) 
 Key observations 
 Attribute values are often temporally annotated 
 If annotation is part of attribute value DBpedia extraction framework ignores it 
 Attribute values are regularly overwritten by Wikipedia editors, but the trace 
remains in Wikipedia revision history 
 DBpedia data extraction process is run on one (e.g. the latest) dump only 
 Proposed solution 
 Run extraction on (part of) revision history 
 Add a temporal tagger to the process 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 14
Extraction pipeline 
1. Select and download Wikipedia revisions 
2. Extract temporal facts 
3. Merge facts 
 Code available at https.//github.com/normalerweise/mte 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 15
Extraction pipeline 
1. Select and download Wikipedia revisions 
 Select 4 revisions per year (1st, 2nd, 3rd quartile and the last revision) 
 Use MediaWiki API to download the revisions 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 16
Extraction pipeline 
2. Extract temporal facts 
 Parse each infobox attribute twice 
 For a value: Mapping Extractor of the DBpedia Extraction Framework 
 For time validity (point or interval): HeidelTime 
 HeidelTime is a multilingual cross-domain rule-based temporal tagger 
 Developed at the University Of Heidelberg 
 http.//dbs.ifi.uni-heidelberg.de/index.php?id=129 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 17
Extraction pipeline 
{{ Infobox company 
| name = Netflix, Inc. 
| revenue = US$4.37  million (''FY 2013'') 
... 
<Netflix, revenue, 4.37E9, usDollar, 2013, 610604061> 
2. Extract temporal facts 
 Parse each infobox attribute twice 
Revision ID 
 For a value: Mapping Extractor of the DBpedia Extraction Framework 
 For time validity (point or interval): HeidelTime 
 HeidelTime is a multilingual cross-domain rule-based temporal tagger 
 Developed at the University Of Heidelberg 
 http.//dbs.ifi.uni-heidelberg.de/index.php?id=129 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 18
Extraction pipeline 
3. Merge facts 
 Group triples by subject, property, temporal validity, value 
 In case of value conflicts, select the most frequent value 
 In case of ties, select the most recent value 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 19
Extraction pipeline 
<Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234> 
<Netflix, operatingIncome, 1.92E8, usDollar, 2009, 387048342> 
<Netflix, operatingIncome, 1.94E8, usDollar, 2009, 439282478> 
<Netflix, operatingIncome, 1.94E8, usDollar, 2009, 426138580> 
<Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234> 
<Netflix, operatingIncome, 1.94E8, usDollar, 2009, 439282478> 
3. Merge facts 
 Group triples by subject, property, temporal validity, value 
 In case of value conflicts, select the most frequent value 
 In case of ties, select the most recent value 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 20
Data model 
 Our choice for RDF representation 
 Singleton property approach 
 Vinh Nguyen, Olivier Bodenreider, and Amit Sheth. Dont like RDF reification? 
Making statements about statements using singleton property, WWW 2014 
 Motivation: performance in terms of #triples, query size and execution time 
 Main idea: unique URI for each predicate instance 
<Netflix, revenue#uniqueId, 4.37E9> 
<revenue#uniqueId, singletonPropertyOf, revenue> 
<revenue#uniqueId, date, 2013> 
<revenue#uniqueId, sourceRevision, 610604061> 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 21
Company dataset 
 Dataset available at http://tiny.cc/tmpcompany 
 Started from DBpedia resources of type dbpedia-owl:Company and 
yago:Company108058098 
 51,214 companies, for 18,489 at least one fact is extracted for 
 assets 
 equity 
 netIncome 
 numberOfEmployees 
 operatingIncome 
 revenue 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 22
Company dataset 
 Dataset available at http://tiny.cc/tmpcompany 
 51,214 companies, for 18,489 at least one fact is extracted for 
 assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 23
Company dataset 
 Dataset available at http://tiny.cc/tmpcompany 
 51,214 companies, for 18,489 at least one fact is extracted for 
 assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 24
Company dataset vs other KBs 
 10 random companies 
with well-maintained 
infoboxes 
 Manually mapped 
ontology properties 
 YAGO2 
 0 triples for these 
companies for 
hasNumberOfPeople 
and hasRevenue 
Our dataset 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 25
Company dataset vs other KBs 
 10 random companies 
with well-maintained 
infoboxes 
 Manually mapped 
ontology properties 
 YAGO2 
 0 triples for these 
companies for 
hasNumberOfPeople 
and hasRevenue 
 Freebase 
 201 vs 58 triples 
Our dataset 
Freebase 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 26
Evaluation 
 Evaluating the precision 
 (preliminary, not in the paper) 
 100 random tuples, 2 properties, so far only one annotator 
 75% for numberOfEmployees and 78% for revenue 
 Caused by parsing errors: DBpedia extraction framework is always tuned 
to work with the latest Wikipedia version 
 After fixing some errors: 97% for numberOfEmployees and 92% for revenue 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 27
Ongoing and future work 
 Ongoing: extracting missing attributes from Wikipedia article texts 
 Company dataset is used for distant supervision 
 Anticipating some questions 
 Yes, we tried the approach for another domain: American football 
 Yes, making the data available through an endpoint is on our todo list 
Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 28

More Related Content

Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes

  • 1. Mining Historical Data for DBpedia via Temporal Tagging of Wikipedia Infoboxes Norman Weisenburger, Volha Bryl, Simone Paolo Ponzetto Data and Web Science Research Group University of Mannheim Germany NLP & DBpedia @ ISWC, Riva del Garda, Italy, October 20, 2014
  • 2. Outline 1. State of art: Temporally annotated data in DBpedia and LOD 2. Temporally annotated data extraction pipeline 3. Company Dataset Statistics Comparison with other KBs 1. Ongoing and future work Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 2
  • 3. Why we need historical LOD Historical data == any data that is or can be temporally annotated population of a city, revenue of a company, current club for a football player Why we need such data Allows having a more precise description of an entity Enables LOD-based data mining for trend prediction Availability of temporally annotated data on the Web of Data Poor and scarce Examples can be found in Freebase, Wikidata, YAGO, Temporally annotated facts or not so frequently time series Some exceptionally good examples follow Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 3
  • 4. Temporally annotated data: Examples Apple Inc. in Wikidata http.//www.wikidata.org/wiki/Q312 Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 4
  • 5. Temporally annotated data: Examples Apple Inc. in Freebase http.//www.freebase.com/m/0k8z Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 5
  • 6. Temporally annotated data in DBpedia DBpedia's main source of knowledge are Wikipedia infoboxes Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute 3. Temporally annotated, annotation is a part of an attribute value Often only the latest value is present When new value is available, the old one is overwritten Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 6
  • 7. Temporally annotated data in DBpedia DBpedia's main source of knowledge are Wikipedia infoboxes Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute 3. Temporally annotated, annotation is a part of an attribute value Often only the latest value is present When new value is available, the old one is overwritten Our focus: case 3, temporal annotation is a part of an attribute value Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 7
  • 8. Temporally annotated data in DBpedia Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated (1) Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 8
  • 9. Temporally annotated data in DBpedia Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute Often lost during DBpedia data extraction E.g. no connection between populationTotal and populationAsOf properties (2) Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 9
  • 10. Temporally annotated data in DBpedia Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute Ends up in DBpedia only if an intermediate node mapping is defined in the mapping wiki (2) (2) Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 10
  • 11. Temporally annotated data in DBpedia Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute Ends up in DBpedia only if an intermediate node mapping is defined in the mapping wiki (2) (2) Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 11
  • 12. Temporally annotated data in DBpedia Temporal (time-dependent) infobox attributes 1. Not at all temporally annotated 2. Temporally annotated, annotation is modeled as a separate attribute 3. Temporally annotated, annotation is a part of an attribute value Annotation is lost during extraction In most cases value is regularly overwritten Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 12
  • 13. Idea: go back in time Properties of interest Temporally annotated, annotation is a part of an attribute value Use case: Business and Financial Data (Companies) Key observations Attribute values are often temporally annotated If annotation is part of attribute value DBpedia extraction framework ignores it Attribute values are regularly overwritten by Wikipedia editors, but the trace remains in Wikipedia revision history DBpedia data extraction process is run on one (e.g. the latest) dump only Proposed solution Run extraction on (part of) revision history Add a temporal tagger to the process Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 14
  • 14. Extraction pipeline 1. Select and download Wikipedia revisions 2. Extract temporal facts 3. Merge facts Code available at https.//github.com/normalerweise/mte Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 15
  • 15. Extraction pipeline 1. Select and download Wikipedia revisions Select 4 revisions per year (1st, 2nd, 3rd quartile and the last revision) Use MediaWiki API to download the revisions Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 16
  • 16. Extraction pipeline 2. Extract temporal facts Parse each infobox attribute twice For a value: Mapping Extractor of the DBpedia Extraction Framework For time validity (point or interval): HeidelTime HeidelTime is a multilingual cross-domain rule-based temporal tagger Developed at the University Of Heidelberg http.//dbs.ifi.uni-heidelberg.de/index.php?id=129 Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 17
  • 17. Extraction pipeline {{ Infobox company | name = Netflix, Inc. | revenue = US$4.37&nbsp; million (''FY 2013'') ... <Netflix, revenue, 4.37E9, usDollar, 2013, 610604061> 2. Extract temporal facts Parse each infobox attribute twice Revision ID For a value: Mapping Extractor of the DBpedia Extraction Framework For time validity (point or interval): HeidelTime HeidelTime is a multilingual cross-domain rule-based temporal tagger Developed at the University Of Heidelberg http.//dbs.ifi.uni-heidelberg.de/index.php?id=129 Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 18
  • 18. Extraction pipeline 3. Merge facts Group triples by subject, property, temporal validity, value In case of value conflicts, select the most frequent value In case of ties, select the most recent value Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 19
  • 19. Extraction pipeline <Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234> <Netflix, operatingIncome, 1.92E8, usDollar, 2009, 387048342> <Netflix, operatingIncome, 1.94E8, usDollar, 2009, 439282478> <Netflix, operatingIncome, 1.94E8, usDollar, 2009, 426138580> <Netflix, operatingIncome, 1.28E8, usDollar, 2008, 352001234> <Netflix, operatingIncome, 1.94E8, usDollar, 2009, 439282478> 3. Merge facts Group triples by subject, property, temporal validity, value In case of value conflicts, select the most frequent value In case of ties, select the most recent value Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 20
  • 20. Data model Our choice for RDF representation Singleton property approach Vinh Nguyen, Olivier Bodenreider, and Amit Sheth. Dont like RDF reification? Making statements about statements using singleton property, WWW 2014 Motivation: performance in terms of #triples, query size and execution time Main idea: unique URI for each predicate instance <Netflix, revenue#uniqueId, 4.37E9> <revenue#uniqueId, singletonPropertyOf, revenue> <revenue#uniqueId, date, 2013> <revenue#uniqueId, sourceRevision, 610604061> Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 21
  • 21. Company dataset Dataset available at http://tiny.cc/tmpcompany Started from DBpedia resources of type dbpedia-owl:Company and yago:Company108058098 51,214 companies, for 18,489 at least one fact is extracted for assets equity netIncome numberOfEmployees operatingIncome revenue Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 22
  • 22. Company dataset Dataset available at http://tiny.cc/tmpcompany 51,214 companies, for 18,489 at least one fact is extracted for assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 23
  • 23. Company dataset Dataset available at http://tiny.cc/tmpcompany 51,214 companies, for 18,489 at least one fact is extracted for assets, equity, netIncome, numberOfEmployees, operatingIncome, revenue Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 24
  • 24. Company dataset vs other KBs 10 random companies with well-maintained infoboxes Manually mapped ontology properties YAGO2 0 triples for these companies for hasNumberOfPeople and hasRevenue Our dataset Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 25
  • 25. Company dataset vs other KBs 10 random companies with well-maintained infoboxes Manually mapped ontology properties YAGO2 0 triples for these companies for hasNumberOfPeople and hasRevenue Freebase 201 vs 58 triples Our dataset Freebase Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 26
  • 26. Evaluation Evaluating the precision (preliminary, not in the paper) 100 random tuples, 2 properties, so far only one annotator 75% for numberOfEmployees and 78% for revenue Caused by parsing errors: DBpedia extraction framework is always tuned to work with the latest Wikipedia version After fixing some errors: 97% for numberOfEmployees and 92% for revenue Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 27
  • 27. Ongoing and future work Ongoing: extracting missing attributes from Wikipedia article texts Company dataset is used for distant supervision Anticipating some questions Yes, we tried the approach for another domain: American football Yes, making the data available through an endpoint is on our todo list Mining Historical Data for DBpedia, Weisenburger, Bryl, Ponzetto 28

Editor's Notes

  • #15: Only the latest value is present in the infobox of interest?
  • #26: Comparison with DBpedia: http://dbpedia.org/resource/Apple Inc. Our dataset contains 45 temporal facts whereas DBpedia currently has one fact for reach relation, i.e. 6 triples
  • #27: Freebase lists EDGAR as one of its data sources. EDGAR is a database which contains information about publicly traded US companies operated by the United States Security and Exchange Commission.