The document discusses why the author's organization initially built a superset of the NLM Journal Publishing Tag Set (JPTS) rather than using JPTS directly. In retrospect, the author argues that a JPTS subset combined with Schematron validation would have been sufficient. Schematron is more flexible than DTDs and can enforce business rules, data types, and style guidelines that a DTD cannot. While requiring more expertise to implement, Schematron validation has advantages over building proprietary tag sets.
1 of 32
Download to read offline
More Related Content
Schwarzman-JATS-Con-slides
1. Superset MeNot:
Why the JPTS Is Sufficient if You Use Appropriate Layer Validation
Alexander (Sasha) Schwarzman
American Geophysical Union (AGU)
JATS-Con
November 2, 2010
2. Summary
We have built a superset of the NLM Journal
Publishing Tag Set and a Schematron validator
to enforce business rules, data types, and
house style.
In retrospect, a JPTS subsetwhen used in
conjunction with the appropriate layer
validation technology, such as Schematron
could have been sufficient to meet AGU's
needs.
Alexander (Sasha) Schwarzman 2Superset MeNot JATS-Con Nov 2, 2010
3. Contents
Why we built a JPTS superset
DTD vs. Schematron
Attribute values
Number of element occurrences
Element position and sequence
References
Lessons learned
Alexander (Sasha) Schwarzman 3Superset MeNot JATS-Con Nov 2, 2010
4. Why we built a JPTS superset
No generic book model, e.g., no book-series-
meta for a book, no xi:include for chapters, etc.
Lack of familiarity with Schematron
Lack of mature tool support (running SVRL not a
viable option in Production environment)
Lack of expertise on using Schematron to
validate against external data sources and
relational DB
JATS v2.3: no Compound Keywords, not all
content models parameterized
Alexander (Sasha) Schwarzman 4Superset MeNot JATS-Con Nov 2, 2010
5. DTD vs. Schematron:
Attribute values
Requirement: Article type is required and can be one of three types:
a regular article (rga), a correction (cor), or an editorial (edt)
Strict DTD
<!ATTLIST article
article-type
(rga | cor | edt) #REQUIRED >
JPTS
<!ATTLIST article
article-type
CDATA #IMPLIED >
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 5
6. DTD vs. Schematron:
Attribute values (contd)
XML instance (contains non-allowed article type)
<article article-type='xxx'/>
Schematron
<rule context="article">
<assert test="@article-type=('rga','cor','edt')">
@article-type '<value-of select='@article-type'/>' not
allowed, must be 'rga', 'cor', or edt'</assert></rule>
Schematron message
@article-type 'xxx' not allowed, must be 'rga', 'cor', or
'edt'
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 6
7. DTD vs. Schematron:
Number of element occurrences
Requirement: Acknowledgments, if present, must contain exactly
one paragraph, except for two journals (journal code ja and
rg) where Acknowledgments must contain two paragraphs
Strict DTD
<!ELEMENT ack (p, p?) >
JPTS
<!ELEMENT ack (p*) >
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 7
8. DTD vs. Schematron:
Number of occurrences (contd)
XML instance (wrong number of paragraphs)
<article>
...
<journal-id>jb</journal-id>
...
<ack>
<p>Blah</p>
<p>Blah-blah</p>
</ack>
</article>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 8
9. DTD vs. Schematron:
Number of occurrences (contd)
Schematron
<rule context="ack[ancestor::*/journal-id=('ja','rg')]">
<assert test="count(p) eq 2">
'<name/>' in '<value-of select="ancestor::*/journal-id"/>'
must contain exactly two paragraphs</assert></rule>
<rule context="ack">
<assert test="count(p) eq 1">
'<name/>' in '<value-of select="ancestor::*/journal-id"/>'
must contain only one paragraph</assert></rule>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 9
10. DTD vs. Schematron:
Number of occurrences (contd)
Schematron message
'ack' in 'jb' must contain only one paragraph
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 10
11. DTD vs. Schematron:
Element position & sequence
Requirement: If a journal uses subject grouping (a ToC category,
a disciplinary subset) and an article belongs to a special
collection (a special section, a theme), then subject grouping
metadata must precede special collection metadata
Strict DTD
<!ELEMENT article-categories
(subject-group*,
special-collection?) >
JPTS
<!ELEMENT article-categories
(subj-group*) >
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 11
12. DTD vs. Schematron:
Element position & sequence (contd)
XML instance (wrong sequence of subject groups)
<article-categories>
<subj-group subj-group-type="special-section">
<subject content-type="EARLYWARN1">New Methods and
Applications of Earthquake Early Warning</subject>
</subj-group>
<subj-group subj-group-type="toc-category">
<subject content-type="SDE">Solid Earth</subject>
</subj-group>
</article-categories>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 12
13. DTD vs. Schematron:
Element position & sequence (contd)
Schematron
<rule context="article-categories/
subj-group[@subj-group-type=('special-section','theme')]">
<assert test="not(following-sibling::
subj-group[@subj-group-type=('toc-category','subset')])">
<name/>/@subj-group-type='<value-of select='@subj-group-
type'/>' must appear after a ToC Category or a Subset
when either is present</assert></rule>
Schematron message
subj-group/@subj-group-type='special-section' must appear
after a ToC Category or a Subset when either is present
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 13
14. DTD vs. Schematron:
References
Validating references is a challenge:
Variety vs. the need to enforce editorial style
Strict DTD:
Fixed element order, no mixed content
Punctuation, spacing, face markup on output
JPTS:
Lots of elements, any order, mixed content
Punctuation, spacing, face markup included
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 14
15. DTD vs. Schematron:
References (contd)
Strict DTD
<!ELEMENT book-standalone-citation
((person-group | string-name),
year,
source,
edition?,
(person-group | string-name)?,
size?,
elocation-id?,
publisher-name,
publisher-loc) >
<!ATTLIST book-standalone-citation
id ID #REQUIRED >
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 15
16. DTD vs. Schematron:
References (contd)
JPTS
<!ELEMENT mixed-citation
(#PCDATA | person-group | string-name |
year | source | edition | size |
elocation-id | publisher-name |
publisher-loc | ... | ...)* >
<!ATTLIST mixed-citation
id ID #IMPLIED
publication-type CDATA #IMPLIED >
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 16
17. DTD vs. Schematron:
References (contd)
Example:
Mood, A. M., and F. A. Graybill (1963),
Introduction to the Theory Statistics, 2nd ed.,
295 pp., McGraw-Hill, New York.
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 17
18. DTD vs. Schematron:
References (contd)
XML instance (strict DTD)
<book-standalone-citation id="mood63">
<person-group person-group-type="author">
<name><surname>Mood</surname>
<given-names>A. M.</given-names></name>
<name><surname>Graybill</surname>
<given-names>F. A.</given-names></name>
</person-group>
<year>1963</year>
<source>Introduction to the Theory Statistics</source>
<edition>2nd</edition>
<size units="page">295 pp<size/>
<publisher-name>McGraw-Hill</publisher-name>
<publisher-loc>New York</publisher-loc>
</book-standalone-citation>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 18
19. DTD vs. Schematron:
References (contd)
XML instance (JPTS)
<mixed-citation publication-type="book-standalone">
<string-name>
<surname>Mood</surname>, <given-names>A. M.</given-names>
</string-name>, and <string-name>
<given-names>F. A.</given-names> <surname>Graybill</surname>
</string-name>
(<year>1963</year>),
<source><italic>Introduction to the
Theory Statistics</italic></source>,
<edition>2</edition>nd ed.,
<size units="page">295</size> pp.,
<publisher-name>McGraw-Hill</publisher-name>,
<publisher-loc>New York</publisher-loc>.
</mixed-citation>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 19
20. DTD vs. Schematron:
References (contd)
Before we proceed, please note:
- required elements
- edition, if present, follows source
- optional elements between source and publisher-name:
<!ELEMENT book-standalone-citation
((person-group | string-name),
year,
source,
edition?,
(person-group | string-name)?,
size?,
elocation-id?,
publisher-name,
publisher-loc) >
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 20
21. DTD vs. Schematron:
References (contd)
Schematron can check that all required elements
are present:
<rule context="mixed-citation[@publication-type='book-
standalone']">
<assert test="(person-group | string-name) and year
and source and publisher-name
and publisher-loc">
required element missing</assert></rule>
& that the elements are in the correct sequence:
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 21
22. DTD vs. Schematron:
References (contd)
XML instance (JPTS) (edition is in the wrong place)
<mixed-citation publication-type="book-standalone">
<string-name>
<surname>Mood</surname>, <given-names>A. M.</given-names>
</string-name>, and <string-name>
<given-names>F. A.</given-names><surname>Graybill</surname>
</string-name>
(<year>1963</year>),
<edition>2</edition>nd ed.,
<source><italic>Introduction to the Theory </italic></source>,
<size units="page">295</size> pp.,
<publisher-name>McGraw-Hill</publisher-name>,
<publisher-loc>New York</publisher-loc>.
</mixed-citation>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 22
23. DTD vs. Schematron:
References (contd)
This Schematron uses positional predicate [1] to check that year is
immediately followed by source:
<rule context="mixed-citation[@publication-type=
'book-standalone']/year">
<assert test="following-sibling::*[1]/self::source">
'<name/>' must be followed by 'source', not by '<value-of
select='name(following-sibling::*[1])'/>'
</assert></rule>
Schematron message
'year' must be immediately followed by 'source', not by 'edition'
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 23
24. DTD vs. Schematron:
References (contd)
But how to check the sequence of required elements when there might
be optional elements interspersed between them?
This Schematron checks that required publisher-name is preceded by
required source, regardless of any optional elements that may
occur in-between:
<rule context="mixed-citation[@publication-type=
'book-standalone']/publisher-name">
<assert test="preceding-sibling::source">
'<name/>' must be preceded by 'source'</assert></rule>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 24
25. DTD vs. Schematron:
References (contd)
Rick Jelliffes approach combines flexibility of JPTS
with benefits of a DTD-like fixed element order:
Each element rewritten as a string of its element
names
Content model represented as a regular expression
Schematron checks the string of names against regex
Schematron generates an error message if content
does not match the model
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 25
26. DTD vs. Schematron:
References (contd)
An XML file, e.g., citation-models.xml, specifies structured citation
models:
...
<model publication-type="book-standalone">
((string-name | person-group),
year,
source,
edition,
(string-name | person-group)?,
size?,
elocation-id?,
publisher-name,
publisher-loc)
</model>
...
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 26
27. DTD vs. Schematron:
References (contd)
Advantages:
XML is still DTD-valid
Mixed content is permitted
Type-sensitive handling of references is possible
Caveat: XSLT 2.0!
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 27
28. Lessons learned
AGU Tag Set + Schematron (200+ checks)
Ensures data quality
Ensures markup integrity
Provides control over production processes
Enforces business rules, data types, and house style
AGU Tag Set is a superset of JPTS
Based on JPTS
Uses the same modularization principles
Can be easily mapped to JPTS
BUT: Were we to do this again we would have
built JPTS subset and a Schematron for it
Alexander (Sasha) Schwarzman 28Superset MeNot JATS-Con Nov 2, 2010
29. Lessons learned (contd)
Appropriate layer validationadvantages:
Even the most Prussian DTD cant enforce all
business rules, data types, and house style
Rules-based checking needed anyway
May as well use Californian JPTS (de facto
industry standard) adopted by publishers,
conversion & composition vendors, archives, etc.
Can use tools developed for JATS: Preview XSLT
stylesheets, EPUBS conversion processes, etc.
Paradigm shift: the crux of validation shifts
from XML parser to Schematron engine
Alexander (Sasha) Schwarzman 29Superset MeNot JATS-Con Nov 2, 2010
30. Lessons learned (contd)
This shift is not without costs:
Content may be valid to JPTS but make no sense
Dependency on Schematron for semantic integrity
Preserving each Schematron release and adding
version info to the contents metadata (?)
Constraints on business partners: must be
Schematron-capable and have tools
Schematron does not fix problemspeople do.
Processes and procedures must be well-defined
Alexander (Sasha) Schwarzman 30Superset MeNot JATS-Con Nov 2, 2010
31. Lessons learned (contd)
Writing a simple Schematron is easy;
building a complex and efficient one is not:
Elicit, document, convey, and clarify the Requirements
Ensure Schematron fits into your workflow
Modularize Schematron
Ensure that individual Schematron rules arent in conflict
Optimize Schematron performance
Employ XSLT 2.0
Test, test, test
Cultivate Schematron & XSLT 2.0 expertise in-house
Alexander (Sasha) Schwarzman 31Superset MeNot JATS-Con Nov 2, 2010
32. Conclusion
What about content that is not like a journal
article, e.g., generic (non-NCBI) books and their
parts/chapters?
When this deficiency is addressed, the NLM
Archiving and Interchange Tag Suite could truly
say:
Superset MeNot!
Alexander (Sasha) Schwarzman 32Superset MeNot JATS-Con Nov 2, 2010