際際滷

際際滷Share a Scribd company logo
NLM CONVERSION TO BUILD
ATOMIC PHYSICS CONTENT IN AN
AGILE FASHION
JATS-CON, April 2, 2014
OSA  The Optical Society &
DCL  Data Conversion Laboratory, Inc.
1
scholarly publisher with 19 current and legacy
journals, 300+ conference proceedings
2
How?
Break 1917-2012 content into well-polished
atomic pieces following an industry standard
Develop infrastructure to manage and enrich
content, to build new products and services in an
agile fashion
Budget allocated for five-year strategic plan
OSA Governance: Build more-
flexible products and services!
3
Some evidence of success
With content converted to NLM XML, have developed
Enhanced article: Interactive HTML
Derivative products: ImageBank
Business Intelligence: New insights into author,
topic, funding, and other trends
4
Citation data
5
Equation data
6
7
Legacy content (750,000 journal pages)
We expected this . . .
8
This . . . not so much
JOURNAL AS
COMIC BOOK
SCHOOL YEARBOOK
9
1. Most confusing: Articles skipping
pages, sometimes in two directions
10
2. Most shocking: legacy PDF not matching
Legacy print
Print
Legacy PDF
for same
article
11
3. Most pervasive:
nonscientific
content tacked
onto research
articles
These are not
the authors
12
Project specifications: two extremes
2. Spend up to a
year doing heavy
content analysis
and spec
creation
1. Hand the
project over to
the trusted
vendor and be
done with it
13
Data Conversion Laboratory
 We convert content from any format to any format.
 Expertise with JATS, and most industry standard DTDs and Schemas
 Established in 1981; a pioneer in the data conversion industry
 Over a billion pages converted
 Expertise in complex conversion projects; STM Publishing, eBooks,
Technical documents, Educational Publishing, and Library Digitization.
 Projects range from one book to entire libraries and legacy collections
 Infrastructure for large-scale projects, with automated tracking, quality
assurance, and customer reporting for every item
 Industries include Publishing, Technical Societies, Aerospace,
Government, Defense, Health Sciences, Libraries & Universities
 Publish DCLNews, a monthly newsletter devoted to XML and
Electronic Publishing topics going to 7,000 subscribers
14
Thoughts on Managing a Large Legacy
Conversion Effort
1) Phased Approach
2) Flexibility and Collaboration
3) Keep it Simple
4) Keep Monitoring Quality
15
1) Phased Approach
Why?
 Varied sources (PDF, XML, SGML)
 Content that changed over time
 Very large input corpus going back to 1917
 Allow for the quick, phased release of new OSA products
Strategy for OSA materials
 Focus on one source type at a time but keep the big picture in mind
 Convert newest material first
 Review and decide on conversion nuances as they came up
16
XML
 OSA Proprietary DTD
 NLM v2.3 DTD
PDF
 PDF Normal
 PDF Image
SGML
 Multiple DTDs
Source Material Challenges
17
 Develop an overall specification, with allowance for change as
new scenarios are uncovered
 Software development sprints to incorporate changes
 Close collaboration with OSA to manage new situations
affecting completed work and work in process
2) Build Flexibility and Collaboration into
the Conversion Process
18
Tools Used to Retain Flexibility
 Client-Vendor
collaboration for decision
making
 Hub and Spoke
processing
 Handling of conversion
anomalies
 Quality assurance reviews
 Learning databanks
19
3)Theres a Lot of Detail  Keep It Simple
 Fitting structures into the existing JATS tagging structure
 CALS to HTML table conversion
 MathML line break retention
 Cross-reference ranges
 Rendering limitations
 Unexpected content scenarios
20
Cross-Reference Ranges
 Bibliographic
 Figure
21
Rendering Limitations
 No CSS support for table character alignment
PDF: HTML:
22
 Missing text - Printed page problems
Unexpected Content Scenarios
23
 Jumping pages
Unexpected Content Scenarios (cont.)
24
 Special characters with no corresponding Unicode
Unexpected Content Scenarios (cont.)
25
<body>
<boxed-text>
<sec>
<title>Optical Activities in Industry</title>
<p>66 Summer Street, North Brookfield, Mass. Mr. Cooke welcomes news and comments
for this column which should be sent to him at the above address</p>
<p>
<inline-graphic xlink:href="ao-8-4-792-i001"/></p>
</sec>
</boxed-text>
____________________________________
 Non-standard Structure
Unexpected Content Scenarios (cont.)
26
Unexpected Content Scenarios (cont.)
 White space filler
27
 Visual review
 OSA Schematron
 Reporting stylesheets
 OCR and hyphenation spellchecker
software
 QA software
 Learning databanks
4) Keep Checking Quality
 Dont Get Too Far Ahead
28
 Correct entities are used
 Math displays correctly
 Table alignment is accurate
 Images correspond to the source
Visual Review
29
 The Schematron includes over 300 checks
Warning:ALERT [LJF:RGCO250]: ref 'b10': unpublished materials
must have @publication-type='other' ($unpublished and @publication-
type != 'communication' and @publication-type != 'other' / warning)
[report]
Warning:ALERT [LJF:JBCO140]: no tables found but title reads
'Figures and Tables' (matches(title, 'Table') and not(exists(table-wrap))
/ warning) [report]
ERROR [LJF:RGCO250]: ref 'b14': journal citation contains more than
one article-title (count(article-title) &gt; 1) [report]
OSA Schematron
30
 Highlight any discrepancies between the specifications and the
tagging
 Identify suspicious start of a paragraph
 Flag missing external files associated with the XML
 Find missing cross references to specified structures such as Tables
and Figures
DCL QA Software
31
Hyphenation
Spellchecker
32
 Provides easier review of metadata components for a set of articles
Reporting Stylesheets
33
 Modified versions of the fonts designed to help distinguish between
similar looking characters  O vs 0, Z vs 2, 1 vs l used
within the proofreading phase
OCR Tools
34
Ongoing updates made based on
feedback and newly determined rules
and structures
 Conversion software
 QA software
 Schematron
 Spellchecker and hyphenation
software
 Editorial guidelines
 Image creation
Learning Databanks
35
Conclusions
OSA has nearly completed a large backfile conversion project in close
coordination with DCL. The project, which is based around NLM markup, has
allowed OSA to enhance its publishing platform, build derivative products, and
significantly improve its ability to gather business intelligence from a deep
journal backfile. We offer the following lessons learned:
 With large content projects, plan ahead but prepare to work in an
agile fashion
 The content owner should stay engaged throughout the project to
align real-time decisions with business aims
 Ownervendor collaborationwhen the right partners are
involvedimproves morale, attention to detail, and decision-
making
36
Scott Dineen
Sr. Director Publishing Production & Technol.
The Optical Society
sdinee@osa.org
Devorah Ashlem
Senior Project Manager
Data Conversion Laboratory
dashlem@dclab.com
37

More Related Content

dineen2013

  • 1. NLM CONVERSION TO BUILD ATOMIC PHYSICS CONTENT IN AN AGILE FASHION JATS-CON, April 2, 2014 OSA The Optical Society & DCL Data Conversion Laboratory, Inc. 1
  • 2. scholarly publisher with 19 current and legacy journals, 300+ conference proceedings 2
  • 3. How? Break 1917-2012 content into well-polished atomic pieces following an industry standard Develop infrastructure to manage and enrich content, to build new products and services in an agile fashion Budget allocated for five-year strategic plan OSA Governance: Build more- flexible products and services! 3
  • 4. Some evidence of success With content converted to NLM XML, have developed Enhanced article: Interactive HTML Derivative products: ImageBank Business Intelligence: New insights into author, topic, funding, and other trends 4
  • 7. 7
  • 8. Legacy content (750,000 journal pages) We expected this . . . 8
  • 9. This . . . not so much JOURNAL AS COMIC BOOK SCHOOL YEARBOOK 9
  • 10. 1. Most confusing: Articles skipping pages, sometimes in two directions 10
  • 11. 2. Most shocking: legacy PDF not matching Legacy print Print Legacy PDF for same article 11
  • 12. 3. Most pervasive: nonscientific content tacked onto research articles These are not the authors 12
  • 13. Project specifications: two extremes 2. Spend up to a year doing heavy content analysis and spec creation 1. Hand the project over to the trusted vendor and be done with it 13
  • 14. Data Conversion Laboratory We convert content from any format to any format. Expertise with JATS, and most industry standard DTDs and Schemas Established in 1981; a pioneer in the data conversion industry Over a billion pages converted Expertise in complex conversion projects; STM Publishing, eBooks, Technical documents, Educational Publishing, and Library Digitization. Projects range from one book to entire libraries and legacy collections Infrastructure for large-scale projects, with automated tracking, quality assurance, and customer reporting for every item Industries include Publishing, Technical Societies, Aerospace, Government, Defense, Health Sciences, Libraries & Universities Publish DCLNews, a monthly newsletter devoted to XML and Electronic Publishing topics going to 7,000 subscribers 14
  • 15. Thoughts on Managing a Large Legacy Conversion Effort 1) Phased Approach 2) Flexibility and Collaboration 3) Keep it Simple 4) Keep Monitoring Quality 15
  • 16. 1) Phased Approach Why? Varied sources (PDF, XML, SGML) Content that changed over time Very large input corpus going back to 1917 Allow for the quick, phased release of new OSA products Strategy for OSA materials Focus on one source type at a time but keep the big picture in mind Convert newest material first Review and decide on conversion nuances as they came up 16
  • 17. XML OSA Proprietary DTD NLM v2.3 DTD PDF PDF Normal PDF Image SGML Multiple DTDs Source Material Challenges 17
  • 18. Develop an overall specification, with allowance for change as new scenarios are uncovered Software development sprints to incorporate changes Close collaboration with OSA to manage new situations affecting completed work and work in process 2) Build Flexibility and Collaboration into the Conversion Process 18
  • 19. Tools Used to Retain Flexibility Client-Vendor collaboration for decision making Hub and Spoke processing Handling of conversion anomalies Quality assurance reviews Learning databanks 19
  • 20. 3)Theres a Lot of Detail Keep It Simple Fitting structures into the existing JATS tagging structure CALS to HTML table conversion MathML line break retention Cross-reference ranges Rendering limitations Unexpected content scenarios 20
  • 22. Rendering Limitations No CSS support for table character alignment PDF: HTML: 22
  • 23. Missing text - Printed page problems Unexpected Content Scenarios 23
  • 24. Jumping pages Unexpected Content Scenarios (cont.) 24
  • 25. Special characters with no corresponding Unicode Unexpected Content Scenarios (cont.) 25
  • 26. <body> <boxed-text> <sec> <title>Optical Activities in Industry</title> <p>66 Summer Street, North Brookfield, Mass. Mr. Cooke welcomes news and comments for this column which should be sent to him at the above address</p> <p> <inline-graphic xlink:href="ao-8-4-792-i001"/></p> </sec> </boxed-text> ____________________________________ Non-standard Structure Unexpected Content Scenarios (cont.) 26
  • 27. Unexpected Content Scenarios (cont.) White space filler 27
  • 28. Visual review OSA Schematron Reporting stylesheets OCR and hyphenation spellchecker software QA software Learning databanks 4) Keep Checking Quality Dont Get Too Far Ahead 28
  • 29. Correct entities are used Math displays correctly Table alignment is accurate Images correspond to the source Visual Review 29
  • 30. The Schematron includes over 300 checks Warning:ALERT [LJF:RGCO250]: ref 'b10': unpublished materials must have @publication-type='other' ($unpublished and @publication- type != 'communication' and @publication-type != 'other' / warning) [report] Warning:ALERT [LJF:JBCO140]: no tables found but title reads 'Figures and Tables' (matches(title, 'Table') and not(exists(table-wrap)) / warning) [report] ERROR [LJF:RGCO250]: ref 'b14': journal citation contains more than one article-title (count(article-title) &gt; 1) [report] OSA Schematron 30
  • 31. Highlight any discrepancies between the specifications and the tagging Identify suspicious start of a paragraph Flag missing external files associated with the XML Find missing cross references to specified structures such as Tables and Figures DCL QA Software 31
  • 33. Provides easier review of metadata components for a set of articles Reporting Stylesheets 33
  • 34. Modified versions of the fonts designed to help distinguish between similar looking characters O vs 0, Z vs 2, 1 vs l used within the proofreading phase OCR Tools 34
  • 35. Ongoing updates made based on feedback and newly determined rules and structures Conversion software QA software Schematron Spellchecker and hyphenation software Editorial guidelines Image creation Learning Databanks 35
  • 36. Conclusions OSA has nearly completed a large backfile conversion project in close coordination with DCL. The project, which is based around NLM markup, has allowed OSA to enhance its publishing platform, build derivative products, and significantly improve its ability to gather business intelligence from a deep journal backfile. We offer the following lessons learned: With large content projects, plan ahead but prepare to work in an agile fashion The content owner should stay engaged throughout the project to align real-time decisions with business aims Ownervendor collaborationwhen the right partners are involvedimproves morale, attention to detail, and decision- making 36
  • 37. Scott Dineen Sr. Director Publishing Production & Technol. The Optical Society sdinee@osa.org Devorah Ashlem Senior Project Manager Data Conversion Laboratory dashlem@dclab.com 37

Editor's Notes

  • #3: OSA is a scholarly publisher with 18 current and legacy journals, 300+ conference proceedings, a member magazine, and a growing number of additional products. Legacy journals go back to 1917 and have heavy physics content. Also content many surprises, e.g., foreign language articles, society news, yearbook-like pix, cartoons, doodles.
  • #6: Citation data modeled in various ways to help researcher size up the article quickly.
  • #7: All equations captured as MathML; display solution with MathJax running on OSA server. Works on all major browsers (including on smart phone) with no user download. MathJax performance will improve with better MathML native support from browsers.
  • #8: Go to Optics ImageBank to browse all figure images across all content!
  • #10: Legacy journals served as the membership magazine but also at times like a yearbook. Particularly challenging were 100s of photos of peopleunrelated to the articlewho where tacked to the ends of articles as filler.
  • #14: In developing specs, we had to achieve balance between two extremes. The method we settled on was highly collaborative and required both parties to be agile. The balance of the presentation will describe the process that OSA and DCL used to work collaboratively to convert 750,000 pages of legacy journal content to well-polished NLM XML.
  • #17: Phased by source material, journal type, newer-to-oldest -> enabled full collection in shortest amount of time.
  • #18: >>Source types we dealt with in the different phases of this project. OSA DTD is XML is created as part of the production process and is used for publishingnot done as an afterthought. Its XML first. NLM 2.3 and SGML material was xml/sgml last.
  • #19: Because of the wide range of inputs and many years of varied material, we wanted to build flexibility into the process and these are three ways. Weekly meetings with OSA scheduled with an agenda in addition to close communication anytime something new came up OSA had to make business decisions about the repercussions regarding going back to make changes
  • #21: No superset of the JATS DTD was created This is a list of some of the conversion details that had to be dealt with. Fitting structures in the NLM make sure that all tags and information converts and is tagged properly in NLM tagging. Tables the tagging structure of CALS vs. HTML is different and so empty cells, spanned cells etc. all have to be properly accounted for otherwise the HTML table will be skewed. MathML equations often appear on several lines in the pdf. Retain these line breaks where possible. In this SGM part of the conversion we are converting the Math as is, so if there are no line breaks in the SGML, then the XML does not have either. Cross reference range information these are the textual callouts to structures (references, figures, tables etc.). Should we retain all of the xrefs within the range or just the end values?
  • #22: These are examples of cross-reference ranges in the source. In the case of citations we include only xrefs of the upper and lower limits of the range. For figures, tables and equations we include <xref> tagging for the items found in the range as well.
  • #23: Since the CSS does not support character table alignment, we have to align columns using only right, left and center. Note how the middle column is aligned center since left/right would not work, so center is the default value for numeric data.
  • #24: As we converted unexpected scenarios came up where text was missing in the source. In cases like these we would turn to OSA for guidance in how to handle the missing text. If another page was available, it would be replaced. Using a placeholder to mark missing text was a typical solution like [[missing text]].
  • #25: The requirements for jumping pages is to include all necessary page numbers in the metadata in <fpage>, <lpage> and <page-range>, and then combine the different pages into one PDF so that one XML and one PDF (that contains all necessary pages) is delivered for an article containing jumping pages.
  • #26: Heres an example of a chemical double bond using Unicode 2550 (renders in red). The point this example brings out, is that since there is no specific Unicode entity for a chemical double bond, it has been mapped to an alternate entity (from a box drawing set of entities), which in this case fortunately looks rather close (or the same) to what was needed.
  • #27: A non-standard structure was found OSA was consulted, and they opted to have it tagged as boxed text.
  • #28: OSA had to determine whether or not they actually wanted to retain this information since this is filler info not really part of the article. In the end they decided to retain it in the <back> - in AO as <bio> and JOSA title as <sec> with a attribute value of end-matter.
  • #30: A visual review is done to make sure the text is visually correct. These are 4 specific things to look out for.
  • #31: Mention Sasha Essential collaboration between DCL and OSA. OSA with subject matter expertise and DCL with technical conversion expertise. This warning was issued because (to be published) was found in a citation, indicating that it has not yet been published, and for unpublished material the citation type attribute should be other. This warning was issued because the title of the section designated for Tables/Figures has only Figures yet the title is Figures and Tables. This error was issued because the citation contains multiple <article-title> tags, and only one is allowed.
  • #33: The sw finds different variations of the same word and outputs this report to help find extra hyphenation. It also checks hyphenation of all words to see if both
  • #34: This report displays the metadata components in a table format, which is visually easy to review. Discrepancies are easily seen and can be highlighted.