狠狠撸

Makoto Murata
eb2mmrt@gmail.com
Keio University and JEPA

? usual
? ????????
? ???????
? ???????
? ?????????????
? ???????
? ?prasta
? ????????
? ??????????
? ????
? демейдеги
? ???
? bình th??ng
? обычный
? 通常の

? 137,374 characters
? 87,887 CJK Unified Ideographs

? Mistakenly introduced characters
? Separation good enough for most people is not good enough
for somebody (see CJK compatibility ideographics).
? Japanese do not necessarily need what Chinese need, and
vice versa.

? Implementations may support subsets.
? No subsets are defined.
? No mechanisms for describing subsets are defined.
? However, it is true that Unicode regular expressions can be
used for representing subsets.

? Implementations may support subsets.
? Taxonomy of subsets
? implementation-defined lists of code points,
? standardized collections as defined in Annex A
? combination of the two.
? Annex A uses multiple notations without formal definitions.

? LATIN-1 SUPPLEMENT (collection 2) is a range 00A0-00FF.
? MULTILINGUAL EUROPEAN SUBSET 2 (collection 282)

Plane 00
Row Values within row
? 00 20-7E A0-FF
? 01 00-7F 8F 92 B7 DE-EF FA-FF
? 02 18-1B 1E-1F 59 7C 92 BB-BD
? C6-C7 C9 D8-DD EE
? 03 74-75 7A 7E 84-8A 8C 8E-A1
? A3-CE D7 DA-E1
? 04 00-5F 90-C4 C7-C8 CB-CC D0-EB
? EE-F5 F8-F9
? 1E 02-03 0A-0B 1E-1F 40-41 56-57
? 60-61 6A-6B 80-85 9B F2-F3
? 1F 00-15 18-1D 20-45 48-4D 50-57
? 59 5B 5D 5F-7D 80-B4 B6-C4
? C6-D3 D6-DB DD-EF F2-F4 F6-FE
? 20 13-15 17-1E 20-22 26 30 32-33
? 39-3A 3C 3E 44 4A 7F 82 A3-A4
? A7 AC AF
? 21 05 16 22 26 5B-5E 90-95 A8
? 22 00 02-03 06 08-09 0F 11-12 19-1A
? 1E-1F 27-2B 48 59 60-61 64-65
? 82-83 95 97
? 23 02 10 20-21 29-2A
? 25 00 02 0C 10 14 18 1C 24 2C 34 3C
? 50-6C 80 84 88 8C 90-93
? A0 AC B2 BA BC C4 CA-CB D8-D9
? 26 3A-3C 40 42 60 63 65-66 6A-6B
? FB 01-02
? FF FD

? JIS2004 IDEOGRAPHICS EXTENSION (collection 371) has 3695
code points.
? BASIC JAPANESE (collection 285) contains 6884 code points.
? IICORE (collection 370) has 9810 code points.
? Ranges are not very useful since code points in CJK
collections are scattered.

? Some collections defined in Annex A contain unassigned code
points.
? Unassigned code points may be assigned by later versions of
ISO/IEC 10646.
? So, validation should provide “yes”, “no”, or “I don’t know”.

? Some collections are defined as the union of other collections.
? MODERN EUROPEAN SCRIPTS (collection 283) is the union of more
than 30 collections, each of which is a simple range.
? COMMON JAPANESE (collection 287) is defined as the union of BASIC
JAPANESE (collection 285) and an enumerated list of 609 code
points.

? A grapheme cluster is a sequence of code points that
represents “user-perceived characters”.
? ‘e’ followed by an accent character
? Japan now has tons of grapheme clusters.

? Plane 00
? 00 41-50 52-56 59-5A 61-70 72-76 79-7A C0-C1 C3 C8-C9 CC-CD D1-D3
D5 D9-DA DD E0-E1 E3 E8-E9 F1-F3 F5 F9-FA FD
? 01 04-05 0C-0D 16-19 28 2E-2F 60-61 68-6B 72-73 7D-7E
? 1E BC-BD F8-F9
? UCS Sequence Identifiers
? <0104, 0301> <0105, 0301> <0104, 0303> <0105, 0303> <0118, 0301>
<0119, 0301> <0118, 0303> <0119, 0303> <0116, 0301> <0117,0301>
<0116, 0303> <0117, 0303> <0069, 0307, 0300> <0069, 0307, 0301>
<0069, 0307, 0303> <012E, 0301> <012F, 0307, 0301>, <012E, 0303>
<012F, 0307, 0303> <004A, 0303> <006A, 0307, 0303> <004C, 0303>
<006C, 0303> <004D, 0303> <006D, 0303> <0052, 0303> <0072, 0303>
<0172, 0301> <0173, 0301> <0172, 0303> <0173, 0303> <016A, 0301>
<016B, 0301> <016A, 0303> <016B, 0303>

? a collection applicable to persons' names in Japanese public
service.
? The number of code points is more than 52000 and that of
grapheme clusters is more than 10000.

? Kyouiku Kanji
? elementary school education
? 1006 characters.
? Jouyou Kanji
? use in official government documents
? 2136 characters
? A list of such subsets from Asian governments is available at
https://github.com/cjkvi/cjkvi-tables

? Based on Adobe-Japan1, JIS standards, 10646 collections and
so forth.
? But they tend to add several characters for some commercial
reasons.
? Font vendors in CITPC (Japanese Character Information
Technology Promotion Council) are searching for machine-
readable notations for describing font coverage.

? Unicode regular expressions can be used for representing
subsets.
? Unicode Common Locale Data Repository use regular expressions
for defining subsets.
? 10646 collections (even CJK collections) can be captured by
Unicode regular expressions in theory.

? Cannot reference collections defined in ISO/IEC 10646.
? Cannot reference other regular expressions.
? Copying is acceptable for small collections, but it not
acceptable for huge collections.
? COMMON JAPANESE (collection 287) references JIS2004
IDEOGRAPHICS EXTENSION (collection 371), which contains 3695
code points.

? Regular expression engines are slow.
? Hash-based set operations are much faster.
? 20 times faster for MULTILINGUAL EUROPEAN SUBSET 2 collection.
? 1600 times faster for the IICORE collection.

? Interesting but never implemented.
? Its own syntax (rather than regular expressions) for
representing ranges and code points, respectively.
? Kernel and hull elements for defining open collections.
? References to other subset descriptions or well-known
subsets (e.g., collections in ISO/IEC 10646)
? Set operations (union, inverse, difference, and intersection).
? No mechanisms for describing grapheme clusters.

? Unicode regular expressions as atomic expressions.
? <code>[abc]</code>
? References to collections defined in ISO/IEC 10646.
? <repertoire registry="10646" number="370"/>
? Typically implemented by hash-based sets.
? References to well-known subsets.
? <ref href=”URI-of-another-CREPDL-script”/>
? Set operation by the union, intersection, and difference
elements.
? kernel and hull

? An open source implementation of CREPDL is available at
https://github.com/CITPCSHARE/CREPDL.
? Written in F# (a functional programming language)
? Uses the ICU regular expression engine
? Large collections in Annex A of ISO/IEC 10646 are implemented as
hash-based sets. Validation against such collections is thus very
efficient.
? Another GitHub for example CREPDL scripts is available at
https://github.com/CITPCSHARE/CREPDLScripts.

? Create the DIS of ISO/IEC 19757-7 CREPDL and start a ballot.
? Sell CREPDL to font vendors in the Japanese Character
Information Technology Promotion Council, of which I am a
board member.
? Compare coverage of fonts automatically by comparing
CREPDL scripts.

狠狠撸

CREPDL: Protect Yourself from the Proliferation of Unicode Characters

More Related Content

CREPDL: Protect Yourself from the Proliferation of Unicode Characters

Editor's Notes