My talk at MarkupUK (2018-06-10).
This paper studies machine-readable notations for describing subsets of Unicode or ISO/IEC 10646. Unicode regular expressions can describe any subset, but they have performance problems for huge subsets and cannot directly capture subsets defined in terms of other subsets. Meanwhile, the upcoming second edition of ISO/IEC 19757-7 Character Repertoire Description Language (CREPDL) overcomes these problems by providing references to well-known subsets and external CREPDL scripts.
1 of 23
Download to read offline
More Related Content
CREPDL: Protect Yourself from the Proliferation of Unicode Characters
4. ? Mistakenly introduced characters
? Separation good enough for most people is not good enough
for somebody (see CJK compatibility ideographics).
? Japanese do not necessarily need what Chinese need, and
vice versa.
5. ? Implementations may support subsets.
? No subsets are defined.
? No mechanisms for describing subsets are defined.
? However, it is true that Unicode regular expressions can be
used for representing subsets.
6. ? Implementations may support subsets.
? Taxonomy of subsets
? implementation-defined lists of code points,
? standardized collections as defined in Annex A
? combination of the two.
? Annex A uses multiple notations without formal definitions.
7. ? LATIN-1 SUPPLEMENT (collection 2) is a range 00A0-00FF.
? MULTILINGUAL EUROPEAN SUBSET 2 (collection 282)
9. ? JIS2004 IDEOGRAPHICS EXTENSION (collection 371) has 3695
code points.
? BASIC JAPANESE (collection 285) contains 6884 code points.
? IICORE (collection 370) has 9810 code points.
? Ranges are not very useful since code points in CJK
collections are scattered.
10. ? Some collections defined in Annex A contain unassigned code
points.
? Unassigned code points may be assigned by later versions of
ISO/IEC 10646.
? So, validation should provide ^yes ̄, ^no ̄, or ^I don¨t know ̄.
11. ? Some collections are defined as the union of other collections.
? MODERN EUROPEAN SCRIPTS (collection 283) is the union of more
than 30 collections, each of which is a simple range.
? COMMON JAPANESE (collection 287) is defined as the union of BASIC
JAPANESE (collection 285) and an enumerated list of 609 code
points.
12. ? A grapheme cluster is a sequence of code points that
represents ^user-perceived characters ̄.
? `e¨ followed by an accent character
? Japan now has tons of grapheme clusters.
14. ? a collection applicable to persons' names in Japanese public
service.
? The number of code points is more than 52000 and that of
grapheme clusters is more than 10000.
15. ? Kyouiku Kanji
? elementary school education
? 1006 characters.
? Jouyou Kanji
? use in official government documents
? 2136 characters
? A list of such subsets from Asian governments is available at
https://github.com/cjkvi/cjkvi-tables
16. ? Based on Adobe-Japan1, JIS standards, 10646 collections and
so forth.
? But they tend to add several characters for some commercial
reasons.
? Font vendors in CITPC (Japanese Character Information
Technology Promotion Council) are searching for machine-
readable notations for describing font coverage.
17. ? Unicode regular expressions can be used for representing
subsets.
? Unicode Common Locale Data Repository use regular expressions
for defining subsets.
? 10646 collections (even CJK collections) can be captured by
Unicode regular expressions in theory.
18. ? Cannot reference collections defined in ISO/IEC 10646.
? Cannot reference other regular expressions.
? Copying is acceptable for small collections, but it not
acceptable for huge collections.
? COMMON JAPANESE (collection 287) references JIS2004
IDEOGRAPHICS EXTENSION (collection 371), which contains 3695
code points.
19. ? Regular expression engines are slow.
? Hash-based set operations are much faster.
? 20 times faster for MULTILINGUAL EUROPEAN SUBSET 2 collection.
? 1600 times faster for the IICORE collection.
20. ? Interesting but never implemented.
? Its own syntax (rather than regular expressions) for
representing ranges and code points, respectively.
? Kernel and hull elements for defining open collections.
? References to other subset descriptions or well-known
subsets (e.g., collections in ISO/IEC 10646)
? Set operations (union, inverse, difference, and intersection).
? No mechanisms for describing grapheme clusters.
21. ? Unicode regular expressions as atomic expressions.
? <code>[abc]</code>
? References to collections defined in ISO/IEC 10646.
? <repertoire registry="10646" number="370"/>
? Typically implemented by hash-based sets.
? References to well-known subsets.
? <ref href= ̄URI-of-another-CREPDL-script ̄/>
? Set operation by the union, intersection, and difference
elements.
? kernel and hull
22. ? An open source implementation of CREPDL is available at
https://github.com/CITPCSHARE/CREPDL.
? Written in F# (a functional programming language)
? Uses the ICU regular expression engine
? Large collections in Annex A of ISO/IEC 10646 are implemented as
hash-based sets. Validation against such collections is thus very
efficient.
? Another GitHub for example CREPDL scripts is available at
https://github.com/CITPCSHARE/CREPDLScripts.
23. ? Create the DIS of ISO/IEC 19757-7 CREPDL and start a ballot.
? Sell CREPDL to font vendors in the Japanese Character
Information Technology Promotion Council, of which I am a
board member.
? Compare coverage of fonts automatically by comparing
CREPDL scripts.
Editor's Notes
#2: I am going to talk about CREPDL, an XML language for describing subsets of Unicode or 10646. When we have to handle huge subsets, my CREPDL validator is more than 1000 times faster than the ICU Unicode regular expression engine.
#3: Let¨s begin with an exam. I used Google this morning to translate ^usual ̄ to many languages. ´.PowerPoint can display all of them. Word can. Emacs can. But my favorite XML editor, Oxygen, cannot. Why?
#4: How many characters does Unicode have? Now, the latest version is 11. It has so many ´
#5: Do you believe that all CJK ideographic characters are needed? A short answer is No.
#6: So, we have too many characters. Does Unicode mandate the support of all characters?
#7: Then, how about 10646? There are interesting and significant differences.
#8: So, let¨s have a quick look at Collections in 10646. Simple ones are very simple.
#9: This is more complicated, but is still not extremely complicated.
#12: We very often want to define subsets in terms of other subsets.
#13: I have used the word ^character ̄, but what users think is a ^character ̄ is not necessarily a single code point in Unicode.
#14: Let¨ see the first collection containing grapheme clusters.
#15: But a CJK collection containing grapheme clusters is much much bigger.
#16: I have talked about collections, which are subsets defined by 10646. But there are many other subsets.
#17: An important type of subsets is font coverage. Each font implicitly defines a subset.
#18: Every Westerner says we only have to use Unicode regular expressions.
#19: There are two significant problems. They are not problematic for small collections, but are, in my opinion, fatal for large collections. The first problems is inability to reference other subsets.