狠狠撸

狠狠撸Share a Scribd company logo
DBpedia 2014 : 
Highlights and Issues 
of the New Release 
Volha Bryl 
Data and Web Science Research Group 
University of Mannheim, Germany 
DBpedia Community Meeting, Leipzig, Germany, September 3, 2014
DBpedia 2014 : Almost Released 
http://dbpedia.org/page/Rome/London 
http://dbpedia.org/sparql/ 
http://dbpedia.org/fct/ 
http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/ 
http://wiki.dbpedia.org/Datasets2014/DatasetStatistics 
DBpedia 2014, Volha Bryl 2
DBpedia 2014 : Almost Released 
? Why “DBpedia 2014”? 
? Suggested at one of the developers’ hangouts 
? 3.10 would be confusing 
? 4.0 would mean “there are major changes/improvements” 
DBpedia 2014, Volha Bryl 3
DBpedia 2014 : Team 
? Research Group Data and Web Science, University of Mannheim 
? Daniel Fleischhacker * 
? Michael Moore * (intern from Uni Waterloo) 
? Volha Bryl * 
? Christian Bizer 
* funded by the LOD2 project 
? With the support of 
? Dimitris Kontokostas 
? Jona Christopher Sahnwaldt 
? Kingsley Idehen, Patrick van Kleef, Mitko Iliev (OpenLink Software) 
? Heiko Paulheim, Petar Ristoski (Uni Mannheim) 
? …the whole DBpedia community… 
DBpedia 2014, Volha Bryl 4
DBpedia 2014 : Facts and Numbers 
? Dumps from April / May 2014 
? 3.9 was based on dumps from March / April 2013 
? Improved mappings 
? http://mappings.dbpedia.org/, mid July 2014 
? 4,339 mappings (3.9: 3,177 mappings) 
? Enlarged Ontology 
? 685 classes (3.9: 529) 
? 1,079 object and 1,600 datatype properties (3.9: 927 and 1,290) 
? More mappings to schema.org, Wikidata, … 
? Mappings to DOLCE ontology, by Aldo Gangemi 
DBpedia 2014, Volha Bryl 5
DBpedia 2014 : Facts and Numbers 
? 125 languages (3.9: 119) 
? 10,000+ articles 
? New mapping-based chapters and data for Belarusian (be), 
Serbian (sr), Welsh (cy), Slovak (sk) 
? Wikimedia Commons extraction 
? New extractors/datasets 
? Length of article page (page-length) 
? Number of out-going links (out-degree) 
? Anchor texts used by links referring to an entity (anchor-text) 
? Should we publish them? 
? Surface forms (anchor-texts + redirect labels + labels) (surface-form) 
DBpedia 2014, Volha Bryl 6
DBpedia 2014 : Facts and Numbers 
(code) 
?New abstract extraction approach, abstracts are much cleaner now 
? Local Wikipedia copy + using Media Wiki API for parsing 
?Canonicalized (-en-uris) dumps based on Wikidata language 
? Based on newly introduced Wikidata extractors 
? old-interlanguage-links dumps contain leftover language links directly 
contained in Wikipedia (ignored for canonicalization) 
?Support for RDF 1.1 
?Improved handling of dates and times 
?Improved handling of external URLs 
DBpedia 2014, Volha Bryl 7
DBpedia 2014 : Facts and Numbers 
Instances, localized (non-en) URIs Mapping-based statements 
3.9 2014 diff, % 3.9 2014 diff, % 
en 4,258,406 4,584,616 7.7 41,804,545 61,255,734 46.5 
nl 1,461,314 1,774,536 21.4 5,039,583 6,752,260 34.0 
de 1,547,785 1,692,634 9.4 4,070,927 6,733,886 65.4 
fr 1,378,099 1,504,453 9.2 5,273,302 6,899,052 30.8 
it 1,029,528 1,128,909 9.7 5,724,415 7,984,501 39.5 
ru 999,165 1,119,142 12.0 3,174,725 4,070,294 28.2 
es 1,003,158 1,086,296 8.3 5,950,626 7,070,608 18.8 
pl 960,880 1,043,400 8.6 4,624,126 6,031,811 30.4 
ja 860,917 913,488 6.1 1,674,891 2,136,719 27.6 
pt 764,132 812,610 6.3 4,489,235 5,098,947 13.6 
* More at http://wiki.dbpedia.org/Datasets2014/DatasetStatistics 
DBpedia 2014, Volha Bryl 8
DBpedia 2014 : Facts and Numbers 
English DBpedia 
3.9 2014 diff, % 
Persons 832,000 1,445,000 73.68 
Places 639,000 735,000 15.02 
Populated Places 427,000 478,000 11.94 
Creative Works 372,000 411,000 10.48 
Music Albums 116,000 123,000 6.03 
Films 78,000 87,000 11.54 
Video Games 18,500 19,000 2.70 
Organizations 209,000 241,000 15.31 
Companies 49,000 58,000 18.37 
Educational Institutions 45,000 49,000 8.89 
Species 226,000 251,000 11.06 
Diseases 5,600 6,000 7.14 
DBpedia 2014, Volha Bryl 9
DBpedia 2014 : Organizational 
? International DBpedia chapters 
? http://wiki.dbpedia.org/Internationalization/Chapters 
? …how many are alive? 
? Call to chapter maintainers: please update the data! 
? …or, do you prefer to (more frequently) extract data on your own? 
? Let’s keep trace of all the ongoing and completed DBpedia projects 
? https://github.com/dbpedia/extraction-framework/wiki 
DBpedia 2014, Volha Bryl 10
DBpedia 2014 : Open Points 
? Mappings are… aging 
? cs (Czech) template usage patterns changed => 
? Fixed: redirects 
in both 
directions are 
now resolved 
? Fix on the 
statistics page? 
DBpedia 2014, Volha Bryl 11
DBpedia 2014 : Open Points 
? Mappings are… aging 
? For some old and new templates parallel mappings exist! 
? New, not detailed, created by Lebot (who is to blame?) 
http://mappings.dbpedia.org/index.php/Mapping_en:Infobox_spaceflight 
? Old, detailed, created in 2010: 
http://mappings.dbpedia.org/index.php/Mapping_en:Infobox_space_mission 
? Redirects from the old to the new one exist in Wikipedia 
? …but our fix would not help in this case 
DBpedia 2014, Volha Bryl 12
DBpedia 2014 : Open Points 
? Wikidata vs. DBpedia 
? Wikidata properties adoption in 
Infoboxes, counts 
? English: 5,229 occurrences 
? Italian: 291 
? German: 460 
=> strategy – ignore (so far) 
? Does any kind of template auto-completion 
exists? 
DBpedia 2014, Volha Bryl 13
DBpedia 2014 : Open Points 
? Nesting templates 
and conditional logic 
? Infobox is filled from 
other templates, 
Extraction 
Framework gives 
very few results 
? Strategy – ??? 
DBpedia 2014, Volha Bryl 14
DBpedia 2014 : More Points 
? Take care about documentation and comments 
? A big effort to improve extraction/release preparation guides while 
working on the release, to be integrated: 
https://github.com/dfleischhacker/extraction-framework/wiki/ 
? Testing and data quality checking should be done on languages 
other than English 
? Next time announce not only mapping, but also coding sprint 
DBpedia 2014, Volha Bryl 15
DBpedia 2014 : More Points 
? Take care about documentation and comments 
? A big effort to improve extraction/release preparation guides while 
working on the release, to be integrated: 
https://github.com/dfleischhacker/extraction-framework/wiki/ 
? Testing and data quality checking should be done on languages 
other than English 
? Next time announce not only mapping, but also coding sprint 
? …volunteers for doing the next release? ? 
DBpedia 2014, Volha Bryl 16

More Related Content

DBpedia 2014: Highlights and Issues of the New Release

  • 1. DBpedia 2014 : Highlights and Issues of the New Release Volha Bryl Data and Web Science Research Group University of Mannheim, Germany DBpedia Community Meeting, Leipzig, Germany, September 3, 2014
  • 2. DBpedia 2014 : Almost Released http://dbpedia.org/page/Rome/London http://dbpedia.org/sparql/ http://dbpedia.org/fct/ http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/ http://wiki.dbpedia.org/Datasets2014/DatasetStatistics DBpedia 2014, Volha Bryl 2
  • 3. DBpedia 2014 : Almost Released ? Why “DBpedia 2014”? ? Suggested at one of the developers’ hangouts ? 3.10 would be confusing ? 4.0 would mean “there are major changes/improvements” DBpedia 2014, Volha Bryl 3
  • 4. DBpedia 2014 : Team ? Research Group Data and Web Science, University of Mannheim ? Daniel Fleischhacker * ? Michael Moore * (intern from Uni Waterloo) ? Volha Bryl * ? Christian Bizer * funded by the LOD2 project ? With the support of ? Dimitris Kontokostas ? Jona Christopher Sahnwaldt ? Kingsley Idehen, Patrick van Kleef, Mitko Iliev (OpenLink Software) ? Heiko Paulheim, Petar Ristoski (Uni Mannheim) ? …the whole DBpedia community… DBpedia 2014, Volha Bryl 4
  • 5. DBpedia 2014 : Facts and Numbers ? Dumps from April / May 2014 ? 3.9 was based on dumps from March / April 2013 ? Improved mappings ? http://mappings.dbpedia.org/, mid July 2014 ? 4,339 mappings (3.9: 3,177 mappings) ? Enlarged Ontology ? 685 classes (3.9: 529) ? 1,079 object and 1,600 datatype properties (3.9: 927 and 1,290) ? More mappings to schema.org, Wikidata, … ? Mappings to DOLCE ontology, by Aldo Gangemi DBpedia 2014, Volha Bryl 5
  • 6. DBpedia 2014 : Facts and Numbers ? 125 languages (3.9: 119) ? 10,000+ articles ? New mapping-based chapters and data for Belarusian (be), Serbian (sr), Welsh (cy), Slovak (sk) ? Wikimedia Commons extraction ? New extractors/datasets ? Length of article page (page-length) ? Number of out-going links (out-degree) ? Anchor texts used by links referring to an entity (anchor-text) ? Should we publish them? ? Surface forms (anchor-texts + redirect labels + labels) (surface-form) DBpedia 2014, Volha Bryl 6
  • 7. DBpedia 2014 : Facts and Numbers (code) ?New abstract extraction approach, abstracts are much cleaner now ? Local Wikipedia copy + using Media Wiki API for parsing ?Canonicalized (-en-uris) dumps based on Wikidata language ? Based on newly introduced Wikidata extractors ? old-interlanguage-links dumps contain leftover language links directly contained in Wikipedia (ignored for canonicalization) ?Support for RDF 1.1 ?Improved handling of dates and times ?Improved handling of external URLs DBpedia 2014, Volha Bryl 7
  • 8. DBpedia 2014 : Facts and Numbers Instances, localized (non-en) URIs Mapping-based statements 3.9 2014 diff, % 3.9 2014 diff, % en 4,258,406 4,584,616 7.7 41,804,545 61,255,734 46.5 nl 1,461,314 1,774,536 21.4 5,039,583 6,752,260 34.0 de 1,547,785 1,692,634 9.4 4,070,927 6,733,886 65.4 fr 1,378,099 1,504,453 9.2 5,273,302 6,899,052 30.8 it 1,029,528 1,128,909 9.7 5,724,415 7,984,501 39.5 ru 999,165 1,119,142 12.0 3,174,725 4,070,294 28.2 es 1,003,158 1,086,296 8.3 5,950,626 7,070,608 18.8 pl 960,880 1,043,400 8.6 4,624,126 6,031,811 30.4 ja 860,917 913,488 6.1 1,674,891 2,136,719 27.6 pt 764,132 812,610 6.3 4,489,235 5,098,947 13.6 * More at http://wiki.dbpedia.org/Datasets2014/DatasetStatistics DBpedia 2014, Volha Bryl 8
  • 9. DBpedia 2014 : Facts and Numbers English DBpedia 3.9 2014 diff, % Persons 832,000 1,445,000 73.68 Places 639,000 735,000 15.02 Populated Places 427,000 478,000 11.94 Creative Works 372,000 411,000 10.48 Music Albums 116,000 123,000 6.03 Films 78,000 87,000 11.54 Video Games 18,500 19,000 2.70 Organizations 209,000 241,000 15.31 Companies 49,000 58,000 18.37 Educational Institutions 45,000 49,000 8.89 Species 226,000 251,000 11.06 Diseases 5,600 6,000 7.14 DBpedia 2014, Volha Bryl 9
  • 10. DBpedia 2014 : Organizational ? International DBpedia chapters ? http://wiki.dbpedia.org/Internationalization/Chapters ? …how many are alive? ? Call to chapter maintainers: please update the data! ? …or, do you prefer to (more frequently) extract data on your own? ? Let’s keep trace of all the ongoing and completed DBpedia projects ? https://github.com/dbpedia/extraction-framework/wiki DBpedia 2014, Volha Bryl 10
  • 11. DBpedia 2014 : Open Points ? Mappings are… aging ? cs (Czech) template usage patterns changed => ? Fixed: redirects in both directions are now resolved ? Fix on the statistics page? DBpedia 2014, Volha Bryl 11
  • 12. DBpedia 2014 : Open Points ? Mappings are… aging ? For some old and new templates parallel mappings exist! ? New, not detailed, created by Lebot (who is to blame?) http://mappings.dbpedia.org/index.php/Mapping_en:Infobox_spaceflight ? Old, detailed, created in 2010: http://mappings.dbpedia.org/index.php/Mapping_en:Infobox_space_mission ? Redirects from the old to the new one exist in Wikipedia ? …but our fix would not help in this case DBpedia 2014, Volha Bryl 12
  • 13. DBpedia 2014 : Open Points ? Wikidata vs. DBpedia ? Wikidata properties adoption in Infoboxes, counts ? English: 5,229 occurrences ? Italian: 291 ? German: 460 => strategy – ignore (so far) ? Does any kind of template auto-completion exists? DBpedia 2014, Volha Bryl 13
  • 14. DBpedia 2014 : Open Points ? Nesting templates and conditional logic ? Infobox is filled from other templates, Extraction Framework gives very few results ? Strategy – ??? DBpedia 2014, Volha Bryl 14
  • 15. DBpedia 2014 : More Points ? Take care about documentation and comments ? A big effort to improve extraction/release preparation guides while working on the release, to be integrated: https://github.com/dfleischhacker/extraction-framework/wiki/ ? Testing and data quality checking should be done on languages other than English ? Next time announce not only mapping, but also coding sprint DBpedia 2014, Volha Bryl 15
  • 16. DBpedia 2014 : More Points ? Take care about documentation and comments ? A big effort to improve extraction/release preparation guides while working on the release, to be integrated: https://github.com/dfleischhacker/extraction-framework/wiki/ ? Testing and data quality checking should be done on languages other than English ? Next time announce not only mapping, but also coding sprint ? …volunteers for doing the next release? ? DBpedia 2014, Volha Bryl 16

Editor's Notes

  • #6: Wikidata dumps from June 2014 116 specialized datatype properties (DBpedia 3.9: 116)
  • #8: Support for RDF 1.1 – new data types? Possibility to in mappings 2 last points – from the commits
  • #9: To 10 languages in terms of number of instances in the 2014 version * More at http://wiki.dbpedia.org/Datasets2014/DatasetStatistics
  • #13: http://en.wikipedia.org/wiki/STS-133 See DBpediaDBpedia2014problems-recheckdistinct-props for more examples
  • #15: It turned out that the templates such as [Template:Automatic_taxobox] make clever use of the [nesting feature] (that is, calling other templates) + some [conditional logic]. The taxonomic information about the species in Wikipedia is encoded in a network of template pages of the form “Template:Taxonomy/taxon_name” (e.g. [Template:Taxonomy/Pterosauria]). Each taxon template page specifies the taxon name (e.g. “Pterosauria”),taxon rank (e.g. “order”) and the parent taxon (e.g. “Pterosauromorpha”). The templates such as [Template:Automatic_taxobox] access this template network to derive the whole classification of a species (or a higher-level taxon) given only its name.
  • #16: More bugs to fix: e.g. check http://live.dbpedia.org/page/C++