1) The Optical Society (OSA) worked with Data Conversion Laboratory (DCL) to convert over 750,000 pages of journal content from 1917-2012 into individual "atomic" pieces using the NLM XML standard.
2) This conversion allowed OSA to develop new products and services more flexibly like enhanced HTML articles and business intelligence tools. It provided evidence of success but also uncovered challenges with the legacy content.
3) Managing the large project required a phased approach, flexibility during conversion to handle unexpected issues, keeping the work simple, and continuous quality checks to ensure accuracy. Close collaboration between OSA and DCL was important for decision making.
1 of 37
Download to read offline
More Related Content
dineen2013
1. NLM CONVERSION TO BUILD
ATOMIC PHYSICS CONTENT IN AN
AGILE FASHION
JATS-CON, April 2, 2014
OSA The Optical Society &
DCL Data Conversion Laboratory, Inc.
1
3. How?
Break 1917-2012 content into well-polished
atomic pieces following an industry standard
Develop infrastructure to manage and enrich
content, to build new products and services in an
agile fashion
Budget allocated for five-year strategic plan
OSA Governance: Build more-
flexible products and services!
3
4. Some evidence of success
With content converted to NLM XML, have developed
Enhanced article: Interactive HTML
Derivative products: ImageBank
Business Intelligence: New insights into author,
topic, funding, and other trends
4
13. Project specifications: two extremes
2. Spend up to a
year doing heavy
content analysis
and spec
creation
1. Hand the
project over to
the trusted
vendor and be
done with it
13
14. Data Conversion Laboratory
We convert content from any format to any format.
Expertise with JATS, and most industry standard DTDs and Schemas
Established in 1981; a pioneer in the data conversion industry
Over a billion pages converted
Expertise in complex conversion projects; STM Publishing, eBooks,
Technical documents, Educational Publishing, and Library Digitization.
Projects range from one book to entire libraries and legacy collections
Infrastructure for large-scale projects, with automated tracking, quality
assurance, and customer reporting for every item
Industries include Publishing, Technical Societies, Aerospace,
Government, Defense, Health Sciences, Libraries & Universities
Publish DCLNews, a monthly newsletter devoted to XML and
Electronic Publishing topics going to 7,000 subscribers
14
15. Thoughts on Managing a Large Legacy
Conversion Effort
1) Phased Approach
2) Flexibility and Collaboration
3) Keep it Simple
4) Keep Monitoring Quality
15
16. 1) Phased Approach
Why?
Varied sources (PDF, XML, SGML)
Content that changed over time
Very large input corpus going back to 1917
Allow for the quick, phased release of new OSA products
Strategy for OSA materials
Focus on one source type at a time but keep the big picture in mind
Convert newest material first
Review and decide on conversion nuances as they came up
16
17. XML
OSA Proprietary DTD
NLM v2.3 DTD
PDF
PDF Normal
PDF Image
SGML
Multiple DTDs
Source Material Challenges
17
18. Develop an overall specification, with allowance for change as
new scenarios are uncovered
Software development sprints to incorporate changes
Close collaboration with OSA to manage new situations
affecting completed work and work in process
2) Build Flexibility and Collaboration into
the Conversion Process
18
19. Tools Used to Retain Flexibility
Client-Vendor
collaboration for decision
making
Hub and Spoke
processing
Handling of conversion
anomalies
Quality assurance reviews
Learning databanks
19
20. 3)Theres a Lot of Detail Keep It Simple
Fitting structures into the existing JATS tagging structure
CALS to HTML table conversion
MathML line break retention
Cross-reference ranges
Rendering limitations
Unexpected content scenarios
20
25. Special characters with no corresponding Unicode
Unexpected Content Scenarios (cont.)
25
26. <body>
<boxed-text>
<sec>
<title>Optical Activities in Industry</title>
<p>66 Summer Street, North Brookfield, Mass. Mr. Cooke welcomes news and comments
for this column which should be sent to him at the above address</p>
<p>
<inline-graphic xlink:href="ao-8-4-792-i001"/></p>
</sec>
</boxed-text>
____________________________________
Non-standard Structure
Unexpected Content Scenarios (cont.)
26
28. Visual review
OSA Schematron
Reporting stylesheets
OCR and hyphenation spellchecker
software
QA software
Learning databanks
4) Keep Checking Quality
Dont Get Too Far Ahead
28
29. Correct entities are used
Math displays correctly
Table alignment is accurate
Images correspond to the source
Visual Review
29
30. The Schematron includes over 300 checks
Warning:ALERT [LJF:RGCO250]: ref 'b10': unpublished materials
must have @publication-type='other' ($unpublished and @publication-
type != 'communication' and @publication-type != 'other' / warning)
[report]
Warning:ALERT [LJF:JBCO140]: no tables found but title reads
'Figures and Tables' (matches(title, 'Table') and not(exists(table-wrap))
/ warning) [report]
ERROR [LJF:RGCO250]: ref 'b14': journal citation contains more than
one article-title (count(article-title) > 1) [report]
OSA Schematron
30
31. Highlight any discrepancies between the specifications and the
tagging
Identify suspicious start of a paragraph
Flag missing external files associated with the XML
Find missing cross references to specified structures such as Tables
and Figures
DCL QA Software
31
33. Provides easier review of metadata components for a set of articles
Reporting Stylesheets
33
34. Modified versions of the fonts designed to help distinguish between
similar looking characters O vs 0, Z vs 2, 1 vs l used
within the proofreading phase
OCR Tools
34
35. Ongoing updates made based on
feedback and newly determined rules
and structures
Conversion software
QA software
Schematron
Spellchecker and hyphenation
software
Editorial guidelines
Image creation
Learning Databanks
35
36. Conclusions
OSA has nearly completed a large backfile conversion project in close
coordination with DCL. The project, which is based around NLM markup, has
allowed OSA to enhance its publishing platform, build derivative products, and
significantly improve its ability to gather business intelligence from a deep
journal backfile. We offer the following lessons learned:
With large content projects, plan ahead but prepare to work in an
agile fashion
The content owner should stay engaged throughout the project to
align real-time decisions with business aims
Ownervendor collaborationwhen the right partners are
involvedimproves morale, attention to detail, and decision-
making
36
37. Scott Dineen
Sr. Director Publishing Production & Technol.
The Optical Society
sdinee@osa.org
Devorah Ashlem
Senior Project Manager
Data Conversion Laboratory
dashlem@dclab.com
37
Editor's Notes
#3: OSA is a scholarly publisher with 18 current and legacy journals, 300+ conference proceedings, a member magazine, and a growing number of additional products.
Legacy journals go back to 1917 and have heavy physics content. Also content many surprises, e.g., foreign language articles, society news, yearbook-like pix, cartoons, doodles.
#6: Citation data modeled in various ways to help researcher size up the article quickly.
#7: All equations captured as MathML; display solution with MathJax running on OSA server.
Works on all major browsers (including on smart phone) with no user download.
MathJax performance will improve with better MathML native support from browsers.
#8: Go to Optics ImageBank to browse all figure images across all content!
#10: Legacy journals served as the membership magazine but also at times like a yearbook.
Particularly challenging were 100s of photos of peopleunrelated to the articlewho where tacked to the ends of articles as filler.
#14: In developing specs, we had to achieve balance between two extremes. The method we settled on was highly collaborative and required both parties to be agile.
The balance of the presentation will describe the process that OSA and DCL used to work collaboratively to convert 750,000 pages of legacy journal content to well-polished NLM XML.
#17: Phased by source material, journal type, newer-to-oldest -> enabled full collection in shortest amount of time.
#18: >>Source types we dealt with in the different phases of this project.
OSA DTD is XML is created as part of the production process and is used for publishingnot done as an afterthought. Its XML first.
NLM 2.3 and SGML material was xml/sgml last.
#19: Because of the wide range of inputs and many years of varied material, we wanted to build flexibility into the process and these are three ways.
Weekly meetings with OSA scheduled with an agenda in addition to close communication anytime something new came up
OSA had to make business decisions about the repercussions regarding going back to make changes
#21: No superset of the JATS DTD was created
This is a list of some of the conversion details that had to be dealt with.
Fitting structures in the NLM make sure that all tags and information converts and is tagged properly in NLM tagging.
Tables the tagging structure of CALS vs. HTML is different and so empty cells, spanned cells etc. all have to be properly accounted for otherwise the HTML table will be skewed.
MathML equations often appear on several lines in the pdf. Retain these line breaks where possible. In this SGM part of the conversion we are converting the Math as is, so if there are no line breaks in the SGML, then the XML does not have either.
Cross reference range information these are the textual callouts to structures (references, figures, tables etc.). Should we retain all of the xrefs within the range or just the end values?
#22: These are examples of cross-reference ranges in the source. In the case of citations we include only xrefs of the upper and lower limits of the range. For figures, tables and equations we include <xref> tagging for the items found in the range as well.
#23: Since the CSS does not support character table alignment, we have to align columns using only right, left and center. Note how the middle column is aligned center since left/right would not work, so center is the default value for numeric data.
#24: As we converted unexpected scenarios came up where text was missing in the source. In cases like these we would turn to OSA for guidance in how to handle the missing text. If another page was available, it would be replaced. Using a placeholder to mark missing text was a typical solution like [[missing text]].
#25: The requirements for jumping pages is to include all necessary page numbers in the metadata in <fpage>, <lpage> and <page-range>, and then combine the different pages into one PDF so that one XML and one PDF (that contains all necessary pages) is delivered for an article containing jumping pages.
#26: Heres an example of a chemical double bond using Unicode 2550 (renders in red). The point this example brings out, is that since there is no specific Unicode entity for a chemical double bond, it has been mapped to an alternate entity (from a box drawing set of entities), which in this case fortunately looks rather close (or the same) to what was needed.
#27: A non-standard structure was found OSA was consulted, and they opted to have it tagged as boxed text.
#28: OSA had to determine whether or not they actually wanted to retain this information since this is filler info not really part of the article. In the end they decided to retain it in the <back> - in AO as <bio> and JOSA title as <sec> with a attribute value of end-matter.
#30: A visual review is done to make sure the text is visually correct. These are 4 specific things to look out for.
#31: Mention Sasha
Essential collaboration between DCL and OSA. OSA with subject matter expertise and DCL with technical conversion expertise.
This warning was issued because (to be published) was found in a citation, indicating that it has not yet been published, and for unpublished material the citation type attribute should be other.
This warning was issued because the title of the section designated for Tables/Figures has only Figures yet the title is Figures and Tables.
This error was issued because the citation contains multiple <article-title> tags, and only one is allowed.
#33: The sw finds different variations of the same word and outputs this report to help find extra hyphenation. It also checks hyphenation of all words to see if both
#34: This report displays the metadata components in a table format, which is visually easy to review. Discrepancies are easily seen and can be highlighted.