際際滷

際際滷Share a Scribd company logo
Makoto Murata
eb2mmrt@gmail.com
Keio University and JEPA
? usual
? ????????
? ???????
? ???????
? ?????????????
? ???????
? ?prasta
? ????????
? ??????????
? ????
? демейдеги
? ???
? b━nh th??ng
? обычный
? 宥械の
? 137,374 characters
? 87,887 CJK Unified Ideographs
? Mistakenly introduced characters
? Separation good enough for most people is not good enough
for somebody (see CJK compatibility ideographics).
? Japanese do not necessarily need what Chinese need, and
vice versa.
? Implementations may support subsets.
? No subsets are defined.
? No mechanisms for describing subsets are defined.
? However, it is true that Unicode regular expressions can be
used for representing subsets.
? Implementations may support subsets.
? Taxonomy of subsets
? implementation-defined lists of code points,
? standardized collections as defined in Annex A
? combination of the two.
? Annex A uses multiple notations without formal definitions.
? LATIN-1 SUPPLEMENT (collection 2) is a range 00A0-00FF.
? MULTILINGUAL EUROPEAN SUBSET 2 (collection 282)
Plane 00
Row Values within row
? 00 20-7E A0-FF
? 01 00-7F 8F 92 B7 DE-EF FA-FF
? 02 18-1B 1E-1F 59 7C 92 BB-BD
? C6-C7 C9 D8-DD EE
? 03 74-75 7A 7E 84-8A 8C 8E-A1
? A3-CE D7 DA-E1
? 04 00-5F 90-C4 C7-C8 CB-CC D0-EB
? EE-F5 F8-F9
? 1E 02-03 0A-0B 1E-1F 40-41 56-57
? 60-61 6A-6B 80-85 9B F2-F3
? 1F 00-15 18-1D 20-45 48-4D 50-57
? 59 5B 5D 5F-7D 80-B4 B6-C4
? C6-D3 D6-DB DD-EF F2-F4 F6-FE
? 20 13-15 17-1E 20-22 26 30 32-33
? 39-3A 3C 3E 44 4A 7F 82 A3-A4
? A7 AC AF
? 21 05 16 22 26 5B-5E 90-95 A8
? 22 00 02-03 06 08-09 0F 11-12 19-1A
? 1E-1F 27-2B 48 59 60-61 64-65
? 82-83 95 97
? 23 02 10 20-21 29-2A
? 25 00 02 0C 10 14 18 1C 24 2C 34 3C
? 50-6C 80 84 88 8C 90-93
? A0 AC B2 BA BC C4 CA-CB D8-D9
? 26 3A-3C 40 42 60 63 65-66 6A-6B
? FB 01-02
? FF FD
? JIS2004 IDEOGRAPHICS EXTENSION (collection 371) has 3695
code points.
? BASIC JAPANESE (collection 285) contains 6884 code points.
? IICORE (collection 370) has 9810 code points.
? Ranges are not very useful since code points in CJK
collections are scattered.
? Some collections defined in Annex A contain unassigned code
points.
? Unassigned code points may be assigned by later versions of
ISO/IEC 10646.
? So, validation should provide ^yes ̄, ^no ̄, or ^I don¨t know ̄.
? Some collections are defined as the union of other collections.
? MODERN EUROPEAN SCRIPTS (collection 283) is the union of more
than 30 collections, each of which is a simple range.
? COMMON JAPANESE (collection 287) is defined as the union of BASIC
JAPANESE (collection 285) and an enumerated list of 609 code
points.
? A grapheme cluster is a sequence of code points that
represents ^user-perceived characters ̄.
? `e¨ followed by an accent character
? Japan now has tons of grapheme clusters.
? Plane 00
? 00 41-50 52-56 59-5A 61-70 72-76 79-7A C0-C1 C3 C8-C9 CC-CD D1-D3
D5 D9-DA DD E0-E1 E3 E8-E9 F1-F3 F5 F9-FA FD
? 01 04-05 0C-0D 16-19 28 2E-2F 60-61 68-6B 72-73 7D-7E
? 1E BC-BD F8-F9
? UCS Sequence Identifiers
? <0104, 0301> <0105, 0301> <0104, 0303> <0105, 0303> <0118, 0301>
<0119, 0301> <0118, 0303> <0119, 0303> <0116, 0301> <0117,0301>
<0116, 0303> <0117, 0303> <0069, 0307, 0300> <0069, 0307, 0301>
<0069, 0307, 0303> <012E, 0301> <012F, 0307, 0301>, <012E, 0303>
<012F, 0307, 0303> <004A, 0303> <006A, 0307, 0303> <004C, 0303>
<006C, 0303> <004D, 0303> <006D, 0303> <0052, 0303> <0072, 0303>
<0172, 0301> <0173, 0301> <0172, 0303> <0173, 0303> <016A, 0301>
<016B, 0301> <016A, 0303> <016B, 0303>
? a collection applicable to persons' names in Japanese public
service.
? The number of code points is more than 52000 and that of
grapheme clusters is more than 10000.
? Kyouiku Kanji
? elementary school education
? 1006 characters.
? Jouyou Kanji
? use in official government documents
? 2136 characters
? A list of such subsets from Asian governments is available at
https://github.com/cjkvi/cjkvi-tables
? Based on Adobe-Japan1, JIS standards, 10646 collections and
so forth.
? But they tend to add several characters for some commercial
reasons.
? Font vendors in CITPC (Japanese Character Information
Technology Promotion Council) are searching for machine-
readable notations for describing font coverage.
? Unicode regular expressions can be used for representing
subsets.
? Unicode Common Locale Data Repository use regular expressions
for defining subsets.
? 10646 collections (even CJK collections) can be captured by
Unicode regular expressions in theory.
? Cannot reference collections defined in ISO/IEC 10646.
? Cannot reference other regular expressions.
? Copying is acceptable for small collections, but it not
acceptable for huge collections.
? COMMON JAPANESE (collection 287) references JIS2004
IDEOGRAPHICS EXTENSION (collection 371), which contains 3695
code points.
? Regular expression engines are slow.
? Hash-based set operations are much faster.
? 20 times faster for MULTILINGUAL EUROPEAN SUBSET 2 collection.
? 1600 times faster for the IICORE collection.
? Interesting but never implemented.
? Its own syntax (rather than regular expressions) for
representing ranges and code points, respectively.
? Kernel and hull elements for defining open collections.
? References to other subset descriptions or well-known
subsets (e.g., collections in ISO/IEC 10646)
? Set operations (union, inverse, difference, and intersection).
? No mechanisms for describing grapheme clusters.
? Unicode regular expressions as atomic expressions.
? <code>[abc]</code>
? References to collections defined in ISO/IEC 10646.
? <repertoire registry="10646" number="370"/>
? Typically implemented by hash-based sets.
? References to well-known subsets.
? <ref href= ̄URI-of-another-CREPDL-script ̄/>
? Set operation by the union, intersection, and difference
elements.
? kernel and hull
? An open source implementation of CREPDL is available at
https://github.com/CITPCSHARE/CREPDL.
? Written in F# (a functional programming language)
? Uses the ICU regular expression engine
? Large collections in Annex A of ISO/IEC 10646 are implemented as
hash-based sets. Validation against such collections is thus very
efficient.
? Another GitHub for example CREPDL scripts is available at
https://github.com/CITPCSHARE/CREPDLScripts.
? Create the DIS of ISO/IEC 19757-7 CREPDL and start a ballot.
? Sell CREPDL to font vendors in the Japanese Character
Information Technology Promotion Council, of which I am a
board member.
? Compare coverage of fonts automatically by comparing
CREPDL scripts.

More Related Content

CREPDL: Protect Yourself from the Proliferation of Unicode Characters

  • 2. ? usual ? ???????? ? ??????? ? ??????? ? ????????????? ? ??????? ? ?prasta ? ???????? ? ?????????? ? ???? ? демейдеги ? ??? ? b━nh th??ng ? обычный ? 宥械の
  • 3. ? 137,374 characters ? 87,887 CJK Unified Ideographs
  • 4. ? Mistakenly introduced characters ? Separation good enough for most people is not good enough for somebody (see CJK compatibility ideographics). ? Japanese do not necessarily need what Chinese need, and vice versa.
  • 5. ? Implementations may support subsets. ? No subsets are defined. ? No mechanisms for describing subsets are defined. ? However, it is true that Unicode regular expressions can be used for representing subsets.
  • 6. ? Implementations may support subsets. ? Taxonomy of subsets ? implementation-defined lists of code points, ? standardized collections as defined in Annex A ? combination of the two. ? Annex A uses multiple notations without formal definitions.
  • 7. ? LATIN-1 SUPPLEMENT (collection 2) is a range 00A0-00FF. ? MULTILINGUAL EUROPEAN SUBSET 2 (collection 282)
  • 8. Plane 00 Row Values within row ? 00 20-7E A0-FF ? 01 00-7F 8F 92 B7 DE-EF FA-FF ? 02 18-1B 1E-1F 59 7C 92 BB-BD ? C6-C7 C9 D8-DD EE ? 03 74-75 7A 7E 84-8A 8C 8E-A1 ? A3-CE D7 DA-E1 ? 04 00-5F 90-C4 C7-C8 CB-CC D0-EB ? EE-F5 F8-F9 ? 1E 02-03 0A-0B 1E-1F 40-41 56-57 ? 60-61 6A-6B 80-85 9B F2-F3 ? 1F 00-15 18-1D 20-45 48-4D 50-57 ? 59 5B 5D 5F-7D 80-B4 B6-C4 ? C6-D3 D6-DB DD-EF F2-F4 F6-FE ? 20 13-15 17-1E 20-22 26 30 32-33 ? 39-3A 3C 3E 44 4A 7F 82 A3-A4 ? A7 AC AF ? 21 05 16 22 26 5B-5E 90-95 A8 ? 22 00 02-03 06 08-09 0F 11-12 19-1A ? 1E-1F 27-2B 48 59 60-61 64-65 ? 82-83 95 97 ? 23 02 10 20-21 29-2A ? 25 00 02 0C 10 14 18 1C 24 2C 34 3C ? 50-6C 80 84 88 8C 90-93 ? A0 AC B2 BA BC C4 CA-CB D8-D9 ? 26 3A-3C 40 42 60 63 65-66 6A-6B ? FB 01-02 ? FF FD
  • 9. ? JIS2004 IDEOGRAPHICS EXTENSION (collection 371) has 3695 code points. ? BASIC JAPANESE (collection 285) contains 6884 code points. ? IICORE (collection 370) has 9810 code points. ? Ranges are not very useful since code points in CJK collections are scattered.
  • 10. ? Some collections defined in Annex A contain unassigned code points. ? Unassigned code points may be assigned by later versions of ISO/IEC 10646. ? So, validation should provide ^yes ̄, ^no ̄, or ^I don¨t know ̄.
  • 11. ? Some collections are defined as the union of other collections. ? MODERN EUROPEAN SCRIPTS (collection 283) is the union of more than 30 collections, each of which is a simple range. ? COMMON JAPANESE (collection 287) is defined as the union of BASIC JAPANESE (collection 285) and an enumerated list of 609 code points.
  • 12. ? A grapheme cluster is a sequence of code points that represents ^user-perceived characters ̄. ? `e¨ followed by an accent character ? Japan now has tons of grapheme clusters.
  • 13. ? Plane 00 ? 00 41-50 52-56 59-5A 61-70 72-76 79-7A C0-C1 C3 C8-C9 CC-CD D1-D3 D5 D9-DA DD E0-E1 E3 E8-E9 F1-F3 F5 F9-FA FD ? 01 04-05 0C-0D 16-19 28 2E-2F 60-61 68-6B 72-73 7D-7E ? 1E BC-BD F8-F9 ? UCS Sequence Identifiers ? <0104, 0301> <0105, 0301> <0104, 0303> <0105, 0303> <0118, 0301> <0119, 0301> <0118, 0303> <0119, 0303> <0116, 0301> <0117,0301> <0116, 0303> <0117, 0303> <0069, 0307, 0300> <0069, 0307, 0301> <0069, 0307, 0303> <012E, 0301> <012F, 0307, 0301>, <012E, 0303> <012F, 0307, 0303> <004A, 0303> <006A, 0307, 0303> <004C, 0303> <006C, 0303> <004D, 0303> <006D, 0303> <0052, 0303> <0072, 0303> <0172, 0301> <0173, 0301> <0172, 0303> <0173, 0303> <016A, 0301> <016B, 0301> <016A, 0303> <016B, 0303>
  • 14. ? a collection applicable to persons' names in Japanese public service. ? The number of code points is more than 52000 and that of grapheme clusters is more than 10000.
  • 15. ? Kyouiku Kanji ? elementary school education ? 1006 characters. ? Jouyou Kanji ? use in official government documents ? 2136 characters ? A list of such subsets from Asian governments is available at https://github.com/cjkvi/cjkvi-tables
  • 16. ? Based on Adobe-Japan1, JIS standards, 10646 collections and so forth. ? But they tend to add several characters for some commercial reasons. ? Font vendors in CITPC (Japanese Character Information Technology Promotion Council) are searching for machine- readable notations for describing font coverage.
  • 17. ? Unicode regular expressions can be used for representing subsets. ? Unicode Common Locale Data Repository use regular expressions for defining subsets. ? 10646 collections (even CJK collections) can be captured by Unicode regular expressions in theory.
  • 18. ? Cannot reference collections defined in ISO/IEC 10646. ? Cannot reference other regular expressions. ? Copying is acceptable for small collections, but it not acceptable for huge collections. ? COMMON JAPANESE (collection 287) references JIS2004 IDEOGRAPHICS EXTENSION (collection 371), which contains 3695 code points.
  • 19. ? Regular expression engines are slow. ? Hash-based set operations are much faster. ? 20 times faster for MULTILINGUAL EUROPEAN SUBSET 2 collection. ? 1600 times faster for the IICORE collection.
  • 20. ? Interesting but never implemented. ? Its own syntax (rather than regular expressions) for representing ranges and code points, respectively. ? Kernel and hull elements for defining open collections. ? References to other subset descriptions or well-known subsets (e.g., collections in ISO/IEC 10646) ? Set operations (union, inverse, difference, and intersection). ? No mechanisms for describing grapheme clusters.
  • 21. ? Unicode regular expressions as atomic expressions. ? <code>[abc]</code> ? References to collections defined in ISO/IEC 10646. ? <repertoire registry="10646" number="370"/> ? Typically implemented by hash-based sets. ? References to well-known subsets. ? <ref href= ̄URI-of-another-CREPDL-script ̄/> ? Set operation by the union, intersection, and difference elements. ? kernel and hull
  • 22. ? An open source implementation of CREPDL is available at https://github.com/CITPCSHARE/CREPDL. ? Written in F# (a functional programming language) ? Uses the ICU regular expression engine ? Large collections in Annex A of ISO/IEC 10646 are implemented as hash-based sets. Validation against such collections is thus very efficient. ? Another GitHub for example CREPDL scripts is available at https://github.com/CITPCSHARE/CREPDLScripts.
  • 23. ? Create the DIS of ISO/IEC 19757-7 CREPDL and start a ballot. ? Sell CREPDL to font vendors in the Japanese Character Information Technology Promotion Council, of which I am a board member. ? Compare coverage of fonts automatically by comparing CREPDL scripts.

Editor's Notes

  • #2: I am going to talk about CREPDL, an XML language for describing subsets of Unicode or 10646. When we have to handle huge subsets, my CREPDL validator is more than 1000 times faster than the ICU Unicode regular expression engine.
  • #3: Let¨s begin with an exam. I used Google this morning to translate ^usual ̄ to many languages. ´. PowerPoint can display all of them. Word can. Emacs can. But my favorite XML editor, Oxygen, cannot. Why?
  • #4: How many characters does Unicode have? Now, the latest version is 11. It has so many ´
  • #5: Do you believe that all CJK ideographic characters are needed? A short answer is No.
  • #6: So, we have too many characters. Does Unicode mandate the support of all characters?
  • #7: Then, how about 10646? There are interesting and significant differences.
  • #8: So, let¨s have a quick look at Collections in 10646. Simple ones are very simple.
  • #9: This is more complicated, but is still not extremely complicated.
  • #10: But CJK collections are much bigger.
  • #11: Here let me introduce open collections.
  • #12: We very often want to define subsets in terms of other subsets.
  • #13: I have used the word ^character ̄, but what users think is a ^character ̄ is not necessarily a single code point in Unicode.
  • #14: Let¨ see the first collection containing grapheme clusters.
  • #15: But a CJK collection containing grapheme clusters is much much bigger.
  • #16: I have talked about collections, which are subsets defined by 10646. But there are many other subsets.
  • #17: An important type of subsets is font coverage. Each font implicitly defines a subset.
  • #18: Every Westerner says we only have to use Unicode regular expressions.
  • #19: There are two significant problems. They are not problematic for small collections, but are, in my opinion, fatal for large collections. The first problems is inability to reference other subsets.
  • #20: The other problem is performance.
  • #21: An interesting alternative is ^A Notation ´. ̄. It has never been implemented, but we can still learn from it.
  • #22: CREPDL is my attempt to combine best parts of both Unicode regular expressions and the W3C notation.
  • #23: I have an implementation of CREPDL. It is open source. It is ´.
  • #24: CREPDL is an attempt to combine regular expressions and the W3C notation.