�ݺ�ߣ

NLM CONVERSION TO BUILD
“ATOMIC” PHYSICS CONTENT IN AN
AGILE FASHION
JATS-CON, April 2, 2014
OSA – The Optical Society &
DCL – Data Conversion Laboratory, Inc.
1

scholarly publisher with 19 current and legacy
journals, 300+ conference proceedings
2

How?
Break 1917-2012 content into “well-polished”
atomic pieces following an industry standard
Develop infrastructure to manage and enrich
content, to build new products and services in an
agile fashion
Budget allocated for five-year strategic plan
OSA Governance: Build more-
flexible products and services!
3

Some evidence of success
With content converted to NLM XML, have developed
Enhanced article: Interactive HTML
Derivative products: ImageBank
Business Intelligence: New insights into author,
topic, funding, and other trends
4

Legacy content (750,000 journal pages)
We expected this . . .
8

This . . . not so much
JOURNAL AS
COMIC BOOK
SCHOOL YEARBOOK
9

1. Most confusing: Articles skipping
pages, sometimes in two directions
10

2. Most shocking: legacy PDF not matching
Legacy print
Print
Legacy PDF
for same
article
11

3. Most pervasive:
nonscientific
content tacked
onto research
articles
These are not
the authors
12

Project specifications: two extremes
2. Spend up to a
year doing heavy
content analysis
and spec
creation
1. Hand the
project over to
the trusted
vendor and be
done with it
13

Data Conversion Laboratory
• We convert content from any format to any format.
• Expertise with JATS, and most industry standard DTD’s and Schemas
• Established in 1981; a pioneer in the data conversion industry
• Over a billion pages converted
• Expertise in complex conversion projects; STM Publishing, eBooks,
Technical documents, Educational Publishing, and Library Digitization.
• Projects range from one book to entire libraries and legacy collections
• Infrastructure for large-scale projects, with automated tracking, quality
assurance, and customer reporting for every item
• Industries include Publishing, Technical Societies, Aerospace,
Government, Defense, Health Sciences, Libraries & Universities
• Publish DCLNews, a monthly newsletter devoted to XML and
Electronic Publishing topics going to 7,000 subscribers
14

Thoughts on Managing a Large Legacy
Conversion Effort
1) Phased Approach
2) Flexibility and Collaboration
3) Keep it Simple
4) Keep Monitoring Quality
15

1) Phased Approach
Why?
• Varied sources (PDF, XML, SGML)
• Content that changed over time
• Very large input corpus going back to 1917
• Allow for the quick, phased release of new OSA products
Strategy for OSA materials
• Focus on one source type at a time but keep the big picture in mind
• Convert newest material first
• Review and decide on conversion nuances as they came up
16

XML
• OSA Proprietary DTD
• NLM v2.3 DTD
PDF
• PDF Normal
• PDF Image
SGML
• Multiple DTDs
Source Material Challenges
17

• Develop an overall specification, with allowance for change as
new scenarios are uncovered
• Software development sprints to incorporate changes
• Close collaboration with OSA to manage new situations
affecting completed work and work in process
2) Build Flexibility and Collaboration into
the Conversion Process
18

Tools Used to Retain Flexibility
• Client-Vendor
collaboration for decision
making
• Hub and Spoke
processing
• Handling of conversion
anomalies
• Quality assurance reviews
• Learning databanks
19

3)There’s a Lot of Detail – Keep It Simple
• Fitting structures into the existing JATS tagging structure
• CALS to HTML table conversion
• MathML line break retention
• Cross-reference ranges
• Rendering limitations
• Unexpected content scenarios
20

Cross-Reference Ranges
• Bibliographic
• Figure
21

Rendering Limitations
• No CSS support for table character alignment
PDF: HTML:
22

• Missing text - Printed page problems
Unexpected Content Scenarios
23

• Jumping pages
Unexpected Content Scenarios (cont.)
24

• Special characters with no corresponding Unicode
25

<body>
<boxed-text>
<sec>
<title>Optical Activities in Industry</title>
<p>66 Summer Street, North Brookfield, Mass. Mr. Cooke welcomes news and comments
for this column which should be sent to him at the above address</p>
<p>
<inline-graphic xlink:href="ao-8-4-792-i001"/></p>
</sec>
</boxed-text>
____________________________________
• Non-standard Structure
26

• White space filler
27

• Visual review
• OSA Schematron
• Reporting stylesheets
• OCR and hyphenation spellchecker
software
• QA software
• Learning databanks
4) Keep Checking Quality
– Don’t Get Too Far Ahead
28

• Correct entities are used
• Math displays correctly
• Table alignment is accurate
• Images correspond to the source
Visual Review
29

• The Schematron includes over 300 checks
Warning:ALERT [LJF:RGCO250]: ref 'b10': unpublished materials
must have @publication-type='other' ($unpublished and @publication-
type != 'communication' and @publication-type != 'other' / warning)
[report]
Warning:ALERT [LJF:JBCO140]: no tables found but title reads
'Figures and Tables' (matches(title, 'Table') and not(exists(table-wrap))
/ warning) [report]
ERROR [LJF:RGCO250]: ref 'b14': journal citation contains more than
one article-title (count(article-title) > 1) [report]
OSA Schematron
30

• Highlight any discrepancies between the specifications and the
tagging
• Identify suspicious start of a paragraph
• Flag missing external files associated with the XML
• Find missing cross references to specified structures such as Tables
and Figures
DCL QA Software
31

• Provides easier review of metadata components for a set of articles
Reporting Stylesheets
33

• Modified versions of the fonts designed to help distinguish between
similar looking characters – “O” vs “0”, “Z” vs “2”, “1” vs “l” used
within the proofreading phase
OCR Tools
34

Ongoing updates made based on
feedback and newly determined rules
and structures
• Conversion software
• QA software
• Schematron
• Spellchecker and hyphenation
software
• Editorial guidelines
• Image creation
Learning Databanks
35

Conclusions
OSA has nearly completed a large backfile conversion project in close
coordination with DCL. The project, which is based around NLM markup, has
allowed OSA to enhance its publishing platform, build derivative products, and
significantly improve its ability to gather business intelligence from a deep
journal backfile. We offer the following lessons learned:
• With large content projects, plan ahead but prepare to work in an
agile fashion
• The content owner should stay engaged throughout the project to
align real-time decisions with business aims
• Owner–vendor collaboration—when the right partners are
involved—improves morale, attention to detail, and decision-
making
36

Scott Dineen
Sr. Director Publishing Production & Technol.
The Optical Society
sdinee@osa.org
Devorah Ashlem
Senior Project Manager
Data Conversion Laboratory
dashlem@dclab.com
37

�ݺ�ߣ

dineen2013

More Related Content

dineen2013

Editor's Notes