ݺߣ

ݺߣShare a Scribd company logo
Transforming data silos into knowledge:
Early Chinese Periodicals Online (ECPO)
Matthias Arnold, Lena Hessel | Heidelberg | E-Science-Tage 2019 | 2019-03-29
Research data C Chinese periodical press
? First decades of the 20th century
? Understudied, but dominated the contemporary print market and
provide access to the "actual culture (R. Williams, 1961)
? Challenges:
? Physically dispersed, often poorly preserved
? Voluminous (full runs, daily, up to >30 years)
? Multi-generic and intellectually demanding
? Approach
? Multi-disciplinary team, >10 researchers
? Women and the Periodical Press in Chinas Global
Twentieth Century: A Space of Their Own? Ed. by Joan
Judge, Barbara Mittler and Michel Hockx, Cambridge
University Press, 2018.
? Database
Early Chinese Periodicals Online (ECPO)
https://uni-heidelberg.de/ecpo
Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)
276 publications: 134 with items
>279.000 scans
40.936 issues: 46.931 articles, 20.532 images, 18.639 ads
Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)
Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)
Chart: Publication activity by year
Arnold and Hessel | ECPO Database
Opening the data silo
From static export to dynamic data service
? Output data using the Metadata Object Description Schema
(MODS) - Open Access: http://ecpo.uni-hd.de/api/mods/
From static pre-rendered files to dynamic image service
? Implementation of International Image Interoperability
Framework (IIIF) Image API http://iiif.io/technical-details/
From separate names to cross-db agents service
? Identify agent, assign names, link to authorities, structure
information, feed data back to authority files (GND)
Agents Service
Arnold and Hessel | ECPO Database
47.245 agents, 163.408 occurrences, 15 languages
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Wikidata
VIAF
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Wikidata
VIAF
Baidu baike
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Wikidata
VIAF
Baidu baike
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
VIAF
GND
Wikidata
VIAF
Baidu baike
Agents with references to authorities:
VIAF: 861
Wikidata: 821
GND: 662
Baidu: 6
DBpedia: 5
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Opening the Agents Service
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Islington Corinthians F.C.:
- Leonard Bradbury
- Jack Braithwaite
- Alec Buchanan
- Pat Clark
- George Dance
- Cyril Longman
- Harry Lowe
- Richard Manning
- Albert (Eddie) Martin
- John Miller
- William Miller
- George Pearce
- Bert Read
- Johnny Sherwood
- Dick Tarrant
- Bill Whittaker
- Ted Wingfield
- J.K. Wright
Source: National Library Board
Singapore NewspaperSG,
accessed March 25, 2019,
http://eresources.nlb.gov.sg/new
spapers/Digitised/Article/straitsti
mes19371128-1.2.117.
Opening the Agents Service
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Opening the Agents Service
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
Towards full text
Arnold and Hessel | ECPO Database
https://uni-heidelberg.de/ecpo
Arnold and Hessel | ECPO Database
Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)
Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)
Expanding data: towards fulltext
? Manual typing not feasible
? Professional double-keying very expensive
? OCR often unusable
? Document: dense layout, normal segmentation fails
? Image: noisy, secondary copies with stains/scratches
? Characters: special characters (emphasis), handwriting
Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)
Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)
Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)
Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)
ca. 63% correctly
recognized
Segmentation - I
? Page segmentation (pattern recognition/computer vision)
? Analyze layout of page, use page-internal structures
? Identify semantic units
? Generate co-ordinates, relate them to items, store in DB
Segmentation - II
? Page segmentation (crowdsourcing)
? Pilot project with Pallas Ludens GmbH
? Let the crowd help analyzing the pages
? Identify and label four item types:
? image/drawing
? article
? advertisement
? additional information
? Supervised
? Non-Chinese speaking community!
Processing
2. Page segmentation (computer vision/ocr)
Grouping semantic units
2. Page segmentation (crowdsourcing)
? drawing C correcting C grouping
Outcome of segmentation pilot
1. Page segmentation can be outsourced to expert crowd
? Requires supervision
? Advanced user interfaces (high usability, efficiency)
? Crowd should read Chinese (semantic grouping)
2. Jingbao  1919-21 completely segmented with qualified
boxes, issues of April 1919 with semantic units
3. Further processing:
? Partnership with Computational Knowledge Lab (֪RӋ
㌍), Department of Engineering Science and
Ocean Engineering, Taiwan National University,
http://www.cklab.org/
? Seeking additional partners for collaboration!
Chinese Republican Periodicals C
Encoding full text in TEI
Arnold and Hessel | ECPO Database
Materiality issues
Mark-up: Different character sizes
<tagsDecl>
<rendition scheme="css" selector="body p">font-size:
100%;</rendition>
<rendition xml:id="half">font-size: 50%</rendition>
<rendition xml:id="double">font-size: 200%</rendition>
</tagsDecl>
<hi rendition="#double">Ů֮</hi>
<hi rendition="#half">ԇ^W<lb/>С
Mark-up: Emphasis
In Japanese: emphasis dots  (kenten) or 
1. ? U+25E6 open dot
2. ? U+2022 filled dot
3.  U+25CB open circle
4.  U+25CF filled circle
5.  U+25CE open double-circle
6. ? U+25C9 filled double-circle
7.  U+25B3 open triangle
8.  U+25B2 filled triangle
9. ? U+FE46 open sesame
10. ? U+FE45 filled sesame
https://drafts.csswg.org/css-text-decor-3/#text-emphasis-style-property
BUT: emphasis characters mixed with
punctuation, differentiation and exact recording is
HUGE workload
-> emphasis characters currently ignored
Mark-up: Spaces between some characters
<space unit="chars" n="1"/>
OR
<gap unit="char" extent="1"> </gap>
(with   being U+3000)
OR
just use U+3000 without markup
TEI Example
Wrap-up
Arnold and Hessel | ECPO Database
From data silo towards open data
? Data collection = research data
? Enhance metadata
? Publishing information, content analysis (keywords)
? Separation of meta-/data from user interface
? FAIR Prinzipien
? DOI records for publications (in progress), connect database
to library catalogs
? Publish material and metadata Open Access, images,
publication metadata, and item metadata (article, image, ad)
? Basic data API (MODS XML)
open up IIIF manifests and Agents data (planned)
? Publish metadata on heiDATA/Dataverse (Summer)
Arnold and Hessel | ECPO Database
Wrap-up
? Provide different ways to access data via frontend:
? Search (all metadata and annotations)
? Browse chronological (calendar)
? Browse/search agents / keywords
? Categories of publications
? Agents service (biographic data)
? cross-db record curation, connect persons with authorities
? plan (2019): add missing agents or names to GND, pull additional
data from authorities, develop agents API
? Page segmentation C crowdsourcing possible, grouping
requires Chinese, new tool creates web-annotations C seeking
partner for automatic page analysis
? Text C plan: process segments, generate full text, store TEI
XML, crowd-based editing
ECPO in a larger context
? Content expansion
? Early western publications printed in China
? Co-operation with Univ. Erlangen: Agents
? ECPO as data platform
? for storing, enhancing, accessing, sharing ?grey
material from the CATS Library
? Outreach/ Communities
? DH-d working group Newspaper/Journals, OCR-d,
Transkribus/READ
? Connect with FID Asien (CrossAsia), Non-Latn scripts
interest group, TEI East Asia SIG
? Long-term repository: University Library, HeiDATA/HeidICON
Arnold and Hessel | ECPO Database
Contact
Matthias Arnold C Lena Hessel
Heidelberg Centre for Transcultural Studies | HCTS
Karl Jaspers Centre
Vo?str. 2 | Building 4400 | Room 005b
69115 Heidelberg, Germany
Phone: +49 - 6221 - 54 4094
eMail: matthias.arnold@uni-hd.de
Web: http://tinyurl.com/matthias-arnold

More Related Content

Recently uploaded (20)

PPTX
How to Add a Custom Button in Odoo 18 POS Screen
Celine George
?
PPTX
Lesson 1 Cell (Structures, Functions, and Theory).pptx
marvinnbustamante1
?
PDF
DIGESTION OF CARBOHYDRATES ,PROTEINS AND LIPIDS
raviralanaresh2
?
PPTX
Different types of inheritance in odoo 18
Celine George
?
DOCX
Lesson 1 - Nature and Inquiry of Research
marvinnbustamante1
?
PDF
I3PM Industry Case Study Siemens on Strategic and Value-Oriented IP Management
MIPLM
?
PPTX
Natural Language processing using nltk.pptx
Ramakrishna Reddy Bijjam
?
PPTX
How to Create & Manage Stages in Odoo 18 Helpdesk
Celine George
?
PDF
Android Programming - Basics of Mobile App, App tools and Android Basics
Kavitha P.V
?
PDF
I3PM Case study smart parking 2025 with uptoIP? and ABP
MIPLM
?
PDF
AI-assisted IP-Design lecture from the MIPLM 2025
MIPLM
?
PDF
Lesson 1 - Nature of Inquiry and Research.pdf
marvinnbustamante1
?
PDF
Our Guide to the July 2025 USPS? Rate Change
Postal Advocate Inc.
?
PDF
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
?
PPTX
AIMA UCSC-SV Leadership_in_the_AI_era 20250628 v16.pptx
home
?
PDF
STATEMENT-BY-THE-HON.-MINISTER-FOR-HEALTH-ON-THE-COVID-19-OUTBREAK-AT-UG_revi...
nservice241
?
PPTX
GENERAL BIOLOGY 1 - Subject Introduction
marvinnbustamante1
?
PPTX
Ward Management: Patient Care, Personnel, Equipment, and Environment.pptx
PRADEEP ABOTHU
?
PPTX
Navigating English Key Stage 2 lerning needs.pptx
JaysonClosa3
?
PPTX
Light Reflection and Refraction- Activities - Class X Science
SONU ACADEMY
?
How to Add a Custom Button in Odoo 18 POS Screen
Celine George
?
Lesson 1 Cell (Structures, Functions, and Theory).pptx
marvinnbustamante1
?
DIGESTION OF CARBOHYDRATES ,PROTEINS AND LIPIDS
raviralanaresh2
?
Different types of inheritance in odoo 18
Celine George
?
Lesson 1 - Nature and Inquiry of Research
marvinnbustamante1
?
I3PM Industry Case Study Siemens on Strategic and Value-Oriented IP Management
MIPLM
?
Natural Language processing using nltk.pptx
Ramakrishna Reddy Bijjam
?
How to Create & Manage Stages in Odoo 18 Helpdesk
Celine George
?
Android Programming - Basics of Mobile App, App tools and Android Basics
Kavitha P.V
?
I3PM Case study smart parking 2025 with uptoIP? and ABP
MIPLM
?
AI-assisted IP-Design lecture from the MIPLM 2025
MIPLM
?
Lesson 1 - Nature of Inquiry and Research.pdf
marvinnbustamante1
?
Our Guide to the July 2025 USPS? Rate Change
Postal Advocate Inc.
?
CAD25 Gbadago and Fafa Presentation Revised-Aston Business School, UK.pdf
Kweku Zurek
?
AIMA UCSC-SV Leadership_in_the_AI_era 20250628 v16.pptx
home
?
STATEMENT-BY-THE-HON.-MINISTER-FOR-HEALTH-ON-THE-COVID-19-OUTBREAK-AT-UG_revi...
nservice241
?
GENERAL BIOLOGY 1 - Subject Introduction
marvinnbustamante1
?
Ward Management: Patient Care, Personnel, Equipment, and Environment.pptx
PRADEEP ABOTHU
?
Navigating English Key Stage 2 lerning needs.pptx
JaysonClosa3
?
Light Reflection and Refraction- Activities - Class X Science
SONU ACADEMY
?

Featured (20)

PDF
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
?
PDF
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
?
PDF
Artificial Intelligence, Data and Competition C SCHREPEL C June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
?
PDF
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
?
PDF
2024 State of Marketing Report C by Hubspot
Marius Sescu
?
PDF
Everything You Need To Know About ChatGPT
Expeed Software
?
PDF
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
?
PDF
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
?
PDF
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
?
PDF
Skeleton Culture Code
Skeleton Technologies
?
PDF
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
?
PDF
Content Methodology: A Best Practices Report (Webinar)
contently
?
PPTX
How to Prepare For a Successful Job Search for 2024
Albert Qian
?
PDF
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
?
PDF
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
?
PDF
5 Public speaking tips from TED - Visualized summary
SpeakerHub
?
PDF
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
?
PDF
Getting into the tech field. what next
Tessa Mero
?
PDF
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
?
PDF
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
?
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
?
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
?
Artificial Intelligence, Data and Competition C SCHREPEL C June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
?
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
?
2024 State of Marketing Report C by Hubspot
Marius Sescu
?
Everything You Need To Know About ChatGPT
Expeed Software
?
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
?
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
?
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
?
Skeleton Culture Code
Skeleton Technologies
?
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
?
Content Methodology: A Best Practices Report (Webinar)
contently
?
How to Prepare For a Successful Job Search for 2024
Albert Qian
?
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
?
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
?
5 Public speaking tips from TED - Visualized summary
SpeakerHub
?
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
?
Getting into the tech field. what next
Tessa Mero
?
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
?
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
?
Ad

Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO)

  • 1. Transforming data silos into knowledge: Early Chinese Periodicals Online (ECPO) Matthias Arnold, Lena Hessel | Heidelberg | E-Science-Tage 2019 | 2019-03-29
  • 2. Research data C Chinese periodical press ? First decades of the 20th century ? Understudied, but dominated the contemporary print market and provide access to the "actual culture (R. Williams, 1961) ? Challenges: ? Physically dispersed, often poorly preserved ? Voluminous (full runs, daily, up to >30 years) ? Multi-generic and intellectually demanding ? Approach ? Multi-disciplinary team, >10 researchers ? Women and the Periodical Press in Chinas Global Twentieth Century: A Space of Their Own? Ed. by Joan Judge, Barbara Mittler and Michel Hockx, Cambridge University Press, 2018. ? Database Early Chinese Periodicals Online (ECPO)
  • 7. 40.936 issues: 46.931 articles, 20.532 images, 18.639 ads
  • 10. Chart: Publication activity by year Arnold and Hessel | ECPO Database
  • 11. Opening the data silo From static export to dynamic data service ? Output data using the Metadata Object Description Schema (MODS) - Open Access: http://ecpo.uni-hd.de/api/mods/ From static pre-rendered files to dynamic image service ? Implementation of International Image Interoperability Framework (IIIF) Image API http://iiif.io/technical-details/ From separate names to cross-db agents service ? Identify agent, assign names, link to authorities, structure information, feed data back to authority files (GND)
  • 12. Agents Service Arnold and Hessel | ECPO Database
  • 13. 47.245 agents, 163.408 occurrences, 15 languages
  • 14. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 15. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 16. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 17. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 18. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 19. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 20. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 21. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 22. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF
  • 23. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND
  • 24. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND Wikidata VIAF
  • 25. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND Wikidata VIAF Baidu baike
  • 26. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND Wikidata VIAF Baidu baike
  • 27. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 VIAF GND Wikidata VIAF Baidu baike Agents with references to authorities: VIAF: 861 Wikidata: 821 GND: 662 Baidu: 6 DBpedia: 5
  • 28. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 29. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 30. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 31. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 32. Opening the Agents Service Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 33. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 34. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 35. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 36. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): From Digitization to Open Data | JADH 2018
  • 37. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 38. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019 Islington Corinthians F.C.: - Leonard Bradbury - Jack Braithwaite - Alec Buchanan - Pat Clark - George Dance - Cyril Longman - Harry Lowe - Richard Manning - Albert (Eddie) Martin - John Miller - William Miller - George Pearce - Bert Read - Johnny Sherwood - Dick Tarrant - Bill Whittaker - Ted Wingfield - J.K. Wright Source: National Library Board Singapore NewspaperSG, accessed March 25, 2019, http://eresources.nlb.gov.sg/new spapers/Digitised/Article/straitsti mes19371128-1.2.117.
  • 39. Opening the Agents Service Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 40. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 41. Opening the Agents Service Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 42. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 43. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 44. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 45. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 46. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 47. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 49. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 50. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 51. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 52. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 53. Arnold, Heidelberg | Early Chinese Periodicals Online (ECPO): Transforming data silos into knowledge | E-Science 2019
  • 54. Towards full text Arnold and Hessel | ECPO Database
  • 58. Expanding data: towards fulltext ? Manual typing not feasible ? Professional double-keying very expensive ? OCR often unusable ? Document: dense layout, normal segmentation fails ? Image: noisy, secondary copies with stains/scratches ? Characters: special characters (emphasis), handwriting
  • 64. Segmentation - I ? Page segmentation (pattern recognition/computer vision) ? Analyze layout of page, use page-internal structures ? Identify semantic units ? Generate co-ordinates, relate them to items, store in DB
  • 65. Segmentation - II ? Page segmentation (crowdsourcing) ? Pilot project with Pallas Ludens GmbH ? Let the crowd help analyzing the pages ? Identify and label four item types: ? image/drawing ? article ? advertisement ? additional information ? Supervised ? Non-Chinese speaking community!
  • 66. Processing 2. Page segmentation (computer vision/ocr)
  • 67. Grouping semantic units 2. Page segmentation (crowdsourcing) ? drawing C correcting C grouping
  • 68. Outcome of segmentation pilot 1. Page segmentation can be outsourced to expert crowd ? Requires supervision ? Advanced user interfaces (high usability, efficiency) ? Crowd should read Chinese (semantic grouping) 2. Jingbao 1919-21 completely segmented with qualified boxes, issues of April 1919 with semantic units 3. Further processing: ? Partnership with Computational Knowledge Lab (֪RӋ ㌍), Department of Engineering Science and Ocean Engineering, Taiwan National University, http://www.cklab.org/ ? Seeking additional partners for collaboration!
  • 69. Chinese Republican Periodicals C Encoding full text in TEI Arnold and Hessel | ECPO Database
  • 71. Mark-up: Different character sizes <tagsDecl> <rendition scheme="css" selector="body p">font-size: 100%;</rendition> <rendition xml:id="half">font-size: 50%</rendition> <rendition xml:id="double">font-size: 200%</rendition> </tagsDecl> <hi rendition="#double">Ů֮</hi> <hi rendition="#half">ԇ^W<lb/>С
  • 72. Mark-up: Emphasis In Japanese: emphasis dots (kenten) or 1. ? U+25E6 open dot 2. ? U+2022 filled dot 3. U+25CB open circle 4. U+25CF filled circle 5. U+25CE open double-circle 6. ? U+25C9 filled double-circle 7. U+25B3 open triangle 8. U+25B2 filled triangle 9. ? U+FE46 open sesame 10. ? U+FE45 filled sesame https://drafts.csswg.org/css-text-decor-3/#text-emphasis-style-property BUT: emphasis characters mixed with punctuation, differentiation and exact recording is HUGE workload -> emphasis characters currently ignored
  • 73. Mark-up: Spaces between some characters <space unit="chars" n="1"/> OR <gap unit="char" extent="1"> </gap> (with being U+3000) OR just use U+3000 without markup
  • 75. Wrap-up Arnold and Hessel | ECPO Database
  • 76. From data silo towards open data ? Data collection = research data ? Enhance metadata ? Publishing information, content analysis (keywords) ? Separation of meta-/data from user interface ? FAIR Prinzipien ? DOI records for publications (in progress), connect database to library catalogs ? Publish material and metadata Open Access, images, publication metadata, and item metadata (article, image, ad) ? Basic data API (MODS XML) open up IIIF manifests and Agents data (planned) ? Publish metadata on heiDATA/Dataverse (Summer) Arnold and Hessel | ECPO Database
  • 77. Wrap-up ? Provide different ways to access data via frontend: ? Search (all metadata and annotations) ? Browse chronological (calendar) ? Browse/search agents / keywords ? Categories of publications ? Agents service (biographic data) ? cross-db record curation, connect persons with authorities ? plan (2019): add missing agents or names to GND, pull additional data from authorities, develop agents API ? Page segmentation C crowdsourcing possible, grouping requires Chinese, new tool creates web-annotations C seeking partner for automatic page analysis ? Text C plan: process segments, generate full text, store TEI XML, crowd-based editing
  • 78. ECPO in a larger context ? Content expansion ? Early western publications printed in China ? Co-operation with Univ. Erlangen: Agents ? ECPO as data platform ? for storing, enhancing, accessing, sharing ?grey material from the CATS Library ? Outreach/ Communities ? DH-d working group Newspaper/Journals, OCR-d, Transkribus/READ ? Connect with FID Asien (CrossAsia), Non-Latn scripts interest group, TEI East Asia SIG ? Long-term repository: University Library, HeiDATA/HeidICON Arnold and Hessel | ECPO Database
  • 79. Contact Matthias Arnold C Lena Hessel Heidelberg Centre for Transcultural Studies | HCTS Karl Jaspers Centre Vo?str. 2 | Building 4400 | Room 005b 69115 Heidelberg, Germany Phone: +49 - 6221 - 54 4094 eMail: matthias.arnold@uni-hd.de Web: http://tinyurl.com/matthias-arnold