際際滷

際際滷Share a Scribd company logo
Superset MeNot:
Why the JPTS Is Sufficient if You Use Appropriate Layer Validation
Alexander (Sasha) Schwarzman
American Geophysical Union (AGU)
JATS-Con
November 2, 2010
Summary
We have built a superset of the NLM Journal
Publishing Tag Set and a Schematron validator
to enforce business rules, data types, and
house style.
In retrospect, a JPTS subsetwhen used in
conjunction with the appropriate layer
validation technology, such as Schematron
could have been sufficient to meet AGU's
needs.
Alexander (Sasha) Schwarzman 2Superset MeNot JATS-Con Nov 2, 2010
Contents
 Why we built a JPTS superset
 DTD vs. Schematron
 Attribute values
 Number of element occurrences
 Element position and sequence
 References
 Lessons learned
Alexander (Sasha) Schwarzman 3Superset MeNot JATS-Con Nov 2, 2010
Why we built a JPTS superset
 No generic book model, e.g., no book-series-
meta for a book, no xi:include for chapters, etc.
 Lack of familiarity with Schematron
 Lack of mature tool support (running SVRL not a
viable option in Production environment)
 Lack of expertise on using Schematron to
validate against external data sources and
relational DB
 JATS v2.3: no Compound Keywords, not all
content models parameterized
Alexander (Sasha) Schwarzman 4Superset MeNot JATS-Con Nov 2, 2010
DTD vs. Schematron:
Attribute values
Requirement: Article type is required and can be one of three types:
a regular article (rga), a correction (cor), or an editorial (edt)
Strict DTD
<!ATTLIST article
article-type
(rga | cor | edt) #REQUIRED >
JPTS
<!ATTLIST article
article-type
CDATA #IMPLIED >
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 5
DTD vs. Schematron:
Attribute values (contd)
XML instance (contains non-allowed article type)
<article article-type='xxx'/>
Schematron
<rule context="article">
<assert test="@article-type=('rga','cor','edt')">
@article-type '<value-of select='@article-type'/>' not
allowed, must be 'rga', 'cor', or edt'</assert></rule>
Schematron message
@article-type 'xxx' not allowed, must be 'rga', 'cor', or
'edt'
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 6
DTD vs. Schematron:
Number of element occurrences
Requirement: Acknowledgments, if present, must contain exactly
one paragraph, except for two journals (journal code ja and
rg) where Acknowledgments must contain two paragraphs
Strict DTD
<!ELEMENT ack (p, p?) >
JPTS
<!ELEMENT ack (p*) >
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 7
DTD vs. Schematron:
Number of occurrences (contd)
XML instance (wrong number of paragraphs)
<article>
...
<journal-id>jb</journal-id>
...
<ack>
<p>Blah</p>
<p>Blah-blah</p>
</ack>
</article>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 8
DTD vs. Schematron:
Number of occurrences (contd)
Schematron
<rule context="ack[ancestor::*/journal-id=('ja','rg')]">
<assert test="count(p) eq 2">
'<name/>' in '<value-of select="ancestor::*/journal-id"/>'
must contain exactly two paragraphs</assert></rule>
<rule context="ack">
<assert test="count(p) eq 1">
'<name/>' in '<value-of select="ancestor::*/journal-id"/>'
must contain only one paragraph</assert></rule>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 9
DTD vs. Schematron:
Number of occurrences (contd)
Schematron message
'ack' in 'jb' must contain only one paragraph
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 10
DTD vs. Schematron:
Element position & sequence
Requirement: If a journal uses subject grouping (a ToC category,
a disciplinary subset) and an article belongs to a special
collection (a special section, a theme), then subject grouping
metadata must precede special collection metadata
Strict DTD
<!ELEMENT article-categories
(subject-group*,
special-collection?) >
JPTS
<!ELEMENT article-categories
(subj-group*) >
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 11
DTD vs. Schematron:
Element position & sequence (contd)
XML instance (wrong sequence of subject groups)
<article-categories>
<subj-group subj-group-type="special-section">
<subject content-type="EARLYWARN1">New Methods and
Applications of Earthquake Early Warning</subject>
</subj-group>
<subj-group subj-group-type="toc-category">
<subject content-type="SDE">Solid Earth</subject>
</subj-group>
</article-categories>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 12
DTD vs. Schematron:
Element position & sequence (contd)
Schematron
<rule context="article-categories/
subj-group[@subj-group-type=('special-section','theme')]">
<assert test="not(following-sibling::
subj-group[@subj-group-type=('toc-category','subset')])">
<name/>/@subj-group-type='<value-of select='@subj-group-
type'/>' must appear after a ToC Category or a Subset
when either is present</assert></rule>
Schematron message
subj-group/@subj-group-type='special-section' must appear
after a ToC Category or a Subset when either is present
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 13
DTD vs. Schematron:
References
Validating references is a challenge:
 Variety vs. the need to enforce editorial style
Strict DTD:
 Fixed element order, no mixed content
 Punctuation, spacing, face markup  on output
JPTS:
 Lots of elements, any order, mixed content
 Punctuation, spacing, face markup included
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 14
DTD vs. Schematron:
References (contd)
Strict DTD
<!ELEMENT book-standalone-citation
((person-group | string-name),
year,
source,
edition?,
(person-group | string-name)?,
size?,
elocation-id?,
publisher-name,
publisher-loc) >
<!ATTLIST book-standalone-citation
id ID #REQUIRED >
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 15
DTD vs. Schematron:
References (contd)
JPTS
<!ELEMENT mixed-citation
(#PCDATA | person-group | string-name |
year | source | edition | size |
elocation-id | publisher-name |
publisher-loc | ... | ...)* >
<!ATTLIST mixed-citation
id ID #IMPLIED
publication-type CDATA #IMPLIED >
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 16
DTD vs. Schematron:
References (contd)
Example:
Mood, A. M., and F. A. Graybill (1963),
Introduction to the Theory Statistics, 2nd ed.,
295 pp., McGraw-Hill, New York.
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 17
DTD vs. Schematron:
References (contd)
XML instance (strict DTD)
<book-standalone-citation id="mood63">
<person-group person-group-type="author">
<name><surname>Mood</surname>
<given-names>A. M.</given-names></name>
<name><surname>Graybill</surname>
<given-names>F. A.</given-names></name>
</person-group>
<year>1963</year>
<source>Introduction to the Theory Statistics</source>
<edition>2nd</edition>
<size units="page">295 pp<size/>
<publisher-name>McGraw-Hill</publisher-name>
<publisher-loc>New York</publisher-loc>
</book-standalone-citation>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 18
DTD vs. Schematron:
References (contd)
XML instance (JPTS)
<mixed-citation publication-type="book-standalone">
<string-name>
<surname>Mood</surname>, <given-names>A. M.</given-names>
</string-name>, and <string-name>
<given-names>F. A.</given-names> <surname>Graybill</surname>
</string-name>
(<year>1963</year>),
<source><italic>Introduction to the
Theory Statistics</italic></source>,
<edition>2</edition>nd ed.,
<size units="page">295</size> pp.,
<publisher-name>McGraw-Hill</publisher-name>,
<publisher-loc>New York</publisher-loc>.
</mixed-citation>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 19
DTD vs. Schematron:
References (contd)
Before we proceed, please note:
- required elements
- edition, if present, follows source
- optional elements between source and publisher-name:
<!ELEMENT book-standalone-citation
((person-group | string-name),
year,
source,
edition?,
(person-group | string-name)?,
size?,
elocation-id?,
publisher-name,
publisher-loc) >
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 20
DTD vs. Schematron:
References (contd)
 Schematron can check that all required elements
are present:
<rule context="mixed-citation[@publication-type='book-
standalone']">
<assert test="(person-group | string-name) and year
and source and publisher-name
and publisher-loc">
required element missing</assert></rule>
 & that the elements are in the correct sequence:
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 21
DTD vs. Schematron:
References (contd)
XML instance (JPTS) (edition is in the wrong place)
<mixed-citation publication-type="book-standalone">
<string-name>
<surname>Mood</surname>, <given-names>A. M.</given-names>
</string-name>, and <string-name>
<given-names>F. A.</given-names><surname>Graybill</surname>
</string-name>
(<year>1963</year>),
<edition>2</edition>nd ed.,
<source><italic>Introduction to the Theory </italic></source>,
<size units="page">295</size> pp.,
<publisher-name>McGraw-Hill</publisher-name>,
<publisher-loc>New York</publisher-loc>.
</mixed-citation>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 22
DTD vs. Schematron:
References (contd)
This Schematron uses positional predicate [1] to check that year is
immediately followed by source:
<rule context="mixed-citation[@publication-type=
'book-standalone']/year">
<assert test="following-sibling::*[1]/self::source">
'<name/>' must be followed by 'source', not by '<value-of
select='name(following-sibling::*[1])'/>'
</assert></rule>
Schematron message
'year' must be immediately followed by 'source', not by 'edition'
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 23
DTD vs. Schematron:
References (contd)
But how to check the sequence of required elements when there might
be optional elements interspersed between them?
This Schematron checks that required publisher-name is preceded by
required source, regardless of any optional elements that may
occur in-between:
<rule context="mixed-citation[@publication-type=
'book-standalone']/publisher-name">
<assert test="preceding-sibling::source">
'<name/>' must be preceded by 'source'</assert></rule>
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 24
DTD vs. Schematron:
References (contd)
 Rick Jelliffes approach combines flexibility of JPTS
with benefits of a DTD-like fixed element order:
 Each element rewritten as a string of its element
names
 Content model represented as a regular expression
 Schematron checks the string of names against regex
 Schematron generates an error message if content
does not match the model
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 25
DTD vs. Schematron:
References (contd)
An XML file, e.g., citation-models.xml, specifies structured citation
models:
...
<model publication-type="book-standalone">
((string-name | person-group),
year,
source,
edition,
(string-name | person-group)?,
size?,
elocation-id?,
publisher-name,
publisher-loc)
</model>
...
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 26
DTD vs. Schematron:
References (contd)
 Advantages:
 XML is still DTD-valid
 Mixed content is permitted
 Type-sensitive handling of references is possible
 Caveat: XSLT 2.0!
Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 27
Lessons learned
 AGU Tag Set + Schematron (200+ checks)
 Ensures data quality
 Ensures markup integrity
 Provides control over production processes
 Enforces business rules, data types, and house style
 AGU Tag Set is a superset of JPTS
 Based on JPTS
 Uses the same modularization principles
 Can be easily mapped to JPTS
 BUT: Were we to do this again we would have
built JPTS subset and a Schematron for it
Alexander (Sasha) Schwarzman 28Superset MeNot JATS-Con Nov 2, 2010
Lessons learned (contd)
 Appropriate layer validationadvantages:
 Even the most Prussian DTD cant enforce all
business rules, data types, and house style
 Rules-based checking needed anyway
 May as well use Californian JPTS (de facto
industry standard) adopted by publishers,
conversion & composition vendors, archives, etc.
 Can use tools developed for JATS: Preview XSLT
stylesheets, EPUBS conversion processes, etc.
 Paradigm shift: the crux of validation shifts
from XML parser to Schematron engine
Alexander (Sasha) Schwarzman 29Superset MeNot JATS-Con Nov 2, 2010
Lessons learned (contd)
 This shift is not without costs:
 Content may be valid to JPTS but make no sense
 Dependency on Schematron for semantic integrity
 Preserving each Schematron release and adding
version info to the contents metadata (?)
 Constraints on business partners: must be
Schematron-capable and have tools
 Schematron does not fix problemspeople do.
Processes and procedures must be well-defined
Alexander (Sasha) Schwarzman 30Superset MeNot JATS-Con Nov 2, 2010
Lessons learned (contd)
 Writing a simple Schematron is easy;
building a complex and efficient one is not:
 Elicit, document, convey, and clarify the Requirements
 Ensure Schematron fits into your workflow
 Modularize Schematron
 Ensure that individual Schematron rules arent in conflict
 Optimize Schematron performance
 Employ XSLT 2.0
 Test, test, test
 Cultivate Schematron & XSLT 2.0 expertise in-house
Alexander (Sasha) Schwarzman 31Superset MeNot JATS-Con Nov 2, 2010
Conclusion
 What about content that is not like a journal
article, e.g., generic (non-NCBI) books and their
parts/chapters?
 When this deficiency is addressed, the NLM
Archiving and Interchange Tag Suite could truly
say:
Superset MeNot!
Alexander (Sasha) Schwarzman 32Superset MeNot JATS-Con Nov 2, 2010

More Related Content

Schwarzman-JATS-Con-slides

  • 1. Superset MeNot: Why the JPTS Is Sufficient if You Use Appropriate Layer Validation Alexander (Sasha) Schwarzman American Geophysical Union (AGU) JATS-Con November 2, 2010
  • 2. Summary We have built a superset of the NLM Journal Publishing Tag Set and a Schematron validator to enforce business rules, data types, and house style. In retrospect, a JPTS subsetwhen used in conjunction with the appropriate layer validation technology, such as Schematron could have been sufficient to meet AGU's needs. Alexander (Sasha) Schwarzman 2Superset MeNot JATS-Con Nov 2, 2010
  • 3. Contents Why we built a JPTS superset DTD vs. Schematron Attribute values Number of element occurrences Element position and sequence References Lessons learned Alexander (Sasha) Schwarzman 3Superset MeNot JATS-Con Nov 2, 2010
  • 4. Why we built a JPTS superset No generic book model, e.g., no book-series- meta for a book, no xi:include for chapters, etc. Lack of familiarity with Schematron Lack of mature tool support (running SVRL not a viable option in Production environment) Lack of expertise on using Schematron to validate against external data sources and relational DB JATS v2.3: no Compound Keywords, not all content models parameterized Alexander (Sasha) Schwarzman 4Superset MeNot JATS-Con Nov 2, 2010
  • 5. DTD vs. Schematron: Attribute values Requirement: Article type is required and can be one of three types: a regular article (rga), a correction (cor), or an editorial (edt) Strict DTD <!ATTLIST article article-type (rga | cor | edt) #REQUIRED > JPTS <!ATTLIST article article-type CDATA #IMPLIED > Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 5
  • 6. DTD vs. Schematron: Attribute values (contd) XML instance (contains non-allowed article type) <article article-type='xxx'/> Schematron <rule context="article"> <assert test="@article-type=('rga','cor','edt')"> @article-type '<value-of select='@article-type'/>' not allowed, must be 'rga', 'cor', or edt'</assert></rule> Schematron message @article-type 'xxx' not allowed, must be 'rga', 'cor', or 'edt' Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 6
  • 7. DTD vs. Schematron: Number of element occurrences Requirement: Acknowledgments, if present, must contain exactly one paragraph, except for two journals (journal code ja and rg) where Acknowledgments must contain two paragraphs Strict DTD <!ELEMENT ack (p, p?) > JPTS <!ELEMENT ack (p*) > Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 7
  • 8. DTD vs. Schematron: Number of occurrences (contd) XML instance (wrong number of paragraphs) <article> ... <journal-id>jb</journal-id> ... <ack> <p>Blah</p> <p>Blah-blah</p> </ack> </article> Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 8
  • 9. DTD vs. Schematron: Number of occurrences (contd) Schematron <rule context="ack[ancestor::*/journal-id=('ja','rg')]"> <assert test="count(p) eq 2"> '<name/>' in '<value-of select="ancestor::*/journal-id"/>' must contain exactly two paragraphs</assert></rule> <rule context="ack"> <assert test="count(p) eq 1"> '<name/>' in '<value-of select="ancestor::*/journal-id"/>' must contain only one paragraph</assert></rule> Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 9
  • 10. DTD vs. Schematron: Number of occurrences (contd) Schematron message 'ack' in 'jb' must contain only one paragraph Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 10
  • 11. DTD vs. Schematron: Element position & sequence Requirement: If a journal uses subject grouping (a ToC category, a disciplinary subset) and an article belongs to a special collection (a special section, a theme), then subject grouping metadata must precede special collection metadata Strict DTD <!ELEMENT article-categories (subject-group*, special-collection?) > JPTS <!ELEMENT article-categories (subj-group*) > Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 11
  • 12. DTD vs. Schematron: Element position & sequence (contd) XML instance (wrong sequence of subject groups) <article-categories> <subj-group subj-group-type="special-section"> <subject content-type="EARLYWARN1">New Methods and Applications of Earthquake Early Warning</subject> </subj-group> <subj-group subj-group-type="toc-category"> <subject content-type="SDE">Solid Earth</subject> </subj-group> </article-categories> Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 12
  • 13. DTD vs. Schematron: Element position & sequence (contd) Schematron <rule context="article-categories/ subj-group[@subj-group-type=('special-section','theme')]"> <assert test="not(following-sibling:: subj-group[@subj-group-type=('toc-category','subset')])"> <name/>/@subj-group-type='<value-of select='@subj-group- type'/>' must appear after a ToC Category or a Subset when either is present</assert></rule> Schematron message subj-group/@subj-group-type='special-section' must appear after a ToC Category or a Subset when either is present Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 13
  • 14. DTD vs. Schematron: References Validating references is a challenge: Variety vs. the need to enforce editorial style Strict DTD: Fixed element order, no mixed content Punctuation, spacing, face markup on output JPTS: Lots of elements, any order, mixed content Punctuation, spacing, face markup included Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 14
  • 15. DTD vs. Schematron: References (contd) Strict DTD <!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) > <!ATTLIST book-standalone-citation id ID #REQUIRED > Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 15
  • 16. DTD vs. Schematron: References (contd) JPTS <!ELEMENT mixed-citation (#PCDATA | person-group | string-name | year | source | edition | size | elocation-id | publisher-name | publisher-loc | ... | ...)* > <!ATTLIST mixed-citation id ID #IMPLIED publication-type CDATA #IMPLIED > Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 16
  • 17. DTD vs. Schematron: References (contd) Example: Mood, A. M., and F. A. Graybill (1963), Introduction to the Theory Statistics, 2nd ed., 295 pp., McGraw-Hill, New York. Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 17
  • 18. DTD vs. Schematron: References (contd) XML instance (strict DTD) <book-standalone-citation id="mood63"> <person-group person-group-type="author"> <name><surname>Mood</surname> <given-names>A. M.</given-names></name> <name><surname>Graybill</surname> <given-names>F. A.</given-names></name> </person-group> <year>1963</year> <source>Introduction to the Theory Statistics</source> <edition>2nd</edition> <size units="page">295 pp<size/> <publisher-name>McGraw-Hill</publisher-name> <publisher-loc>New York</publisher-loc> </book-standalone-citation> Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 18
  • 19. DTD vs. Schematron: References (contd) XML instance (JPTS) <mixed-citation publication-type="book-standalone"> <string-name> <surname>Mood</surname>, <given-names>A. M.</given-names> </string-name>, and <string-name> <given-names>F. A.</given-names> <surname>Graybill</surname> </string-name> (<year>1963</year>), <source><italic>Introduction to the Theory Statistics</italic></source>, <edition>2</edition>nd ed., <size units="page">295</size> pp., <publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York</publisher-loc>. </mixed-citation> Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 19
  • 20. DTD vs. Schematron: References (contd) Before we proceed, please note: - required elements - edition, if present, follows source - optional elements between source and publisher-name: <!ELEMENT book-standalone-citation ((person-group | string-name), year, source, edition?, (person-group | string-name)?, size?, elocation-id?, publisher-name, publisher-loc) > Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 20
  • 21. DTD vs. Schematron: References (contd) Schematron can check that all required elements are present: <rule context="mixed-citation[@publication-type='book- standalone']"> <assert test="(person-group | string-name) and year and source and publisher-name and publisher-loc"> required element missing</assert></rule> & that the elements are in the correct sequence: Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 21
  • 22. DTD vs. Schematron: References (contd) XML instance (JPTS) (edition is in the wrong place) <mixed-citation publication-type="book-standalone"> <string-name> <surname>Mood</surname>, <given-names>A. M.</given-names> </string-name>, and <string-name> <given-names>F. A.</given-names><surname>Graybill</surname> </string-name> (<year>1963</year>), <edition>2</edition>nd ed., <source><italic>Introduction to the Theory </italic></source>, <size units="page">295</size> pp., <publisher-name>McGraw-Hill</publisher-name>, <publisher-loc>New York</publisher-loc>. </mixed-citation> Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 22
  • 23. DTD vs. Schematron: References (contd) This Schematron uses positional predicate [1] to check that year is immediately followed by source: <rule context="mixed-citation[@publication-type= 'book-standalone']/year"> <assert test="following-sibling::*[1]/self::source"> '<name/>' must be followed by 'source', not by '<value-of select='name(following-sibling::*[1])'/>' </assert></rule> Schematron message 'year' must be immediately followed by 'source', not by 'edition' Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 23
  • 24. DTD vs. Schematron: References (contd) But how to check the sequence of required elements when there might be optional elements interspersed between them? This Schematron checks that required publisher-name is preceded by required source, regardless of any optional elements that may occur in-between: <rule context="mixed-citation[@publication-type= 'book-standalone']/publisher-name"> <assert test="preceding-sibling::source"> '<name/>' must be preceded by 'source'</assert></rule> Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 24
  • 25. DTD vs. Schematron: References (contd) Rick Jelliffes approach combines flexibility of JPTS with benefits of a DTD-like fixed element order: Each element rewritten as a string of its element names Content model represented as a regular expression Schematron checks the string of names against regex Schematron generates an error message if content does not match the model Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 25
  • 26. DTD vs. Schematron: References (contd) An XML file, e.g., citation-models.xml, specifies structured citation models: ... <model publication-type="book-standalone"> ((string-name | person-group), year, source, edition, (string-name | person-group)?, size?, elocation-id?, publisher-name, publisher-loc) </model> ... Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 26
  • 27. DTD vs. Schematron: References (contd) Advantages: XML is still DTD-valid Mixed content is permitted Type-sensitive handling of references is possible Caveat: XSLT 2.0! Alexander (Sasha) Schwarzman Superset MeNot JATS-Con Nov 2, 2010 27
  • 28. Lessons learned AGU Tag Set + Schematron (200+ checks) Ensures data quality Ensures markup integrity Provides control over production processes Enforces business rules, data types, and house style AGU Tag Set is a superset of JPTS Based on JPTS Uses the same modularization principles Can be easily mapped to JPTS BUT: Were we to do this again we would have built JPTS subset and a Schematron for it Alexander (Sasha) Schwarzman 28Superset MeNot JATS-Con Nov 2, 2010
  • 29. Lessons learned (contd) Appropriate layer validationadvantages: Even the most Prussian DTD cant enforce all business rules, data types, and house style Rules-based checking needed anyway May as well use Californian JPTS (de facto industry standard) adopted by publishers, conversion & composition vendors, archives, etc. Can use tools developed for JATS: Preview XSLT stylesheets, EPUBS conversion processes, etc. Paradigm shift: the crux of validation shifts from XML parser to Schematron engine Alexander (Sasha) Schwarzman 29Superset MeNot JATS-Con Nov 2, 2010
  • 30. Lessons learned (contd) This shift is not without costs: Content may be valid to JPTS but make no sense Dependency on Schematron for semantic integrity Preserving each Schematron release and adding version info to the contents metadata (?) Constraints on business partners: must be Schematron-capable and have tools Schematron does not fix problemspeople do. Processes and procedures must be well-defined Alexander (Sasha) Schwarzman 30Superset MeNot JATS-Con Nov 2, 2010
  • 31. Lessons learned (contd) Writing a simple Schematron is easy; building a complex and efficient one is not: Elicit, document, convey, and clarify the Requirements Ensure Schematron fits into your workflow Modularize Schematron Ensure that individual Schematron rules arent in conflict Optimize Schematron performance Employ XSLT 2.0 Test, test, test Cultivate Schematron & XSLT 2.0 expertise in-house Alexander (Sasha) Schwarzman 31Superset MeNot JATS-Con Nov 2, 2010
  • 32. Conclusion What about content that is not like a journal article, e.g., generic (non-NCBI) books and their parts/chapters? When this deficiency is addressed, the NLM Archiving and Interchange Tag Suite could truly say: Superset MeNot! Alexander (Sasha) Schwarzman 32Superset MeNot JATS-Con Nov 2, 2010