The document discusses the development of a thesaurus of classical Japanese poetic vocabulary. It outlines how the thesaurus was created by analyzing poems from the Hachidaishu anthologies using techniques like tokenization, meta-code conversion, and matching original poems to scholarly translations to extract vocabulary terms and their meanings over time. The goal is to better understand the connotation and historical transition of classical poetic words in a longitudinal study.
1 of 30
Download to read offline
More Related Content
Tokyotech20130715
1. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 1
Development of the Thesaurus of Classical
Japanese Poetic Vocabulary
Hilofumi Yamamoto
Tokyo Institute of Technology
15th July 2013
2. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 2
Outline
1. Purpose of Study
? Connotation of classical poetic vocabulary
? Longitudinal study of transition of vocabulary
2. Development of Thesaurus
3. Applications
3. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 3
Waka: Japanese Poetry
Tatsuta-Hime..
tamukuru KAMI no / arebakoso
aki no konoha no / nusa to chirurame
because Princess Tatsuta
has a god to whom she o?ers brocades,
the leaves of trees
in autumn will scatter
as an o?ering.
Prince Kanemi
No. 298 in the Kokinsh?u
4. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 4
Problem: Orthography
in hiragana
たつた
in Chinese characters
羨弥
o弥
弥
★ All Tatsuta (place name)
5. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 5
Problem: Unit size / attribution
The unit size and meaning of a word depends on a context.
? unit ★ 樽の雑 or 樽の雑 (Nakano, 1998)
? orthography ★ さびしいさみしい偲しい捜しい
(sad)
? attributions ★ 樽の雑 ( plant or 樽の雑 ( food
(unohana = a deutzia or bean curd refuse)
6. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 6
An Item of Thesaurus: God
BG-01-2030-01-030-A-かみ-舞
● ● ● ● ● ● ● ●
(1) (2) (3) (4) (5) (6) (7) (8)
Figure 1: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) ?eld ID;
(5) exact ID (030 = god);
(6) era-?ag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character
7. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 7
Development: Thesaurus, KH, and t2c
? Thesaurus for classical poetic vocabulary
? KH (tokenizer)
? t2c (token to code converter)
8. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 8
Materials: the Hachidaish?u
? The Hachidaish?u: eight anthologies compiled by
imperial orders during ca. 905C2105.
? The database: compiled by the National Institute of
Japanese Literature, Japan.
? Old texts taken based on Sh?ohobonban version of the
Hachidaish?u
900
?
K
okinsh?u
(?905)
46
950
?
G
osensh?u
(?951)
56
1000
?
J?uish?u
(?1007)
79
1050
?
G
osh?uish?u
(1086)
38
1100
?
K
iny?osh?u
(?1124)
20
?
Shikash?u
(?1144)
44
1150
?
Senzaish?u
(1188)
17
1200
?
Shinkokinsh?u
(1205)
1250
9. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 9
Methods: Flowchart of data processing
A
Corpus development
B
Tokenisation
C
Meta-code conversion
D
Mathematical modelling
E
Subtraction: CT ? OP
F
Visualisation
10. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 10
Development: Thesaurus, KH, and t2c
? Thesaurus for classical poetic vocabulary
? KH (tokenizer)
? t2c (token to code converter)
11. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 11
Table 1: An example of input for KH / Gosensh?u No. 664
input: 000664 わすられて房ふなげきのしげるをや附をはづかしのもりといふらん
output:000664
わすら (ラ膨-隆:梨る:わする:梨ら:わすら)
れ (徭辛鞭-喘:る:る:れ:れ)
て (俊廁:て:て)
房ふ (ハ膨-K悶:房ふ:おもふ:房ふ:おもふ)
なげき (カ膨-喘:@く:なげく:@き:なげき)
の (鯉廁:の:の)
しげる (ラ膨-K悶:誰る:しげる:誰る:しげる)
を (廁:を:を)
や (S廁:や:や)
附 (兆:附:み)
を (廁:を:を)
---
はづかし (兆-仇兆:嚼崩:はづかし)
の (鯉廁:の:の)
---
はづかし (侘シク-K:uづかし:はづかし:uづかし:はづかし)
の (鯉廁:の:の)
---
もり (兆:畢:もり)
と (鯉廁-哈喘:と:と)
いふ (ハ膨-K悶:冱ふ:いふ:冱ふ:いふ)
らん (容-K悶:らむ:らむ:らむ:らむ)
12. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 12
Development: Thesaurus
Poem Texts kh t2c
Thesaurus
code taggerTokeniser
Hachidaishu
Thesaurus
(A) (B)
add new thesaurus codes
Dictionary General, Place Name
Personal Name, etc
add unknown entries
13. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 13
(A) Corpus: Poems (OP)
KW00029800|A|KANEMI NO ?O=kanemi no ?o
KW00029800|B|Tatsutahime[NOUN-PLNAME:TATSUTAHIME]/★
tamukuru[KASHIMO2-ATTR:TAMUkuru],kami[NOUN:KAMI]★
no[SUB]are[RAHEN-REAL]ba[CAUS]koso[KP]/★
aki[NOUN:AKI]no[CON],konoha[NOUN:KOnoHA]no[SUB]/★
nusa[NOUN:NUSA]to[P-CRD],chiru[RA4DAN-FF:CHIru]★
rame[CJR-REAL]/
Figure 2: Format of the database of a poem: ★ indicates continuing to the
next line without breaks; the ?rst line, which includes |A|, indicates
the name of the poet; the second line which includes |B|, indicates
the contents of the poem and added information.
14. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 14
(A) Corpus: Translations (CT)
$A|000298
$B|拍の挑除くなって「り祇についた弥が、祇嶄のo並をって返鬚 ★
をする舞があるからこそ、拍の直の~がナとなって柊っているのだろう。
$C|拍の梧
$D|拍の挑除くなって「り祇についた弥が、祇嶄のo並をって返鬚 ★
をする舞があるからこそ、拍の直の~がナとなって柊っているのだろう。
$I|あきのすえちかくなってかえりみちについたたつたひめが、どうちゅう ★
のぶじをねがってたむけをするかみがあるからこそ、あきのこのはがぬさ ★
となってちっているのだろう。
Figure 3: Format of the database of a CT
15. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 15
(B) Tokenisation:
original text
羨弥返鬚韻詆颪林个譴个海叔錣猟召糧~のナと柊るらめ
◎
tokenising
羨弥/返鬚韻/舞/の/[嗤れ]/ば/こそ/拍/の/直の~/の/ナ/と/柊る/[らめ]
◎
converting into predicative form
羨弥/返鬚韻/舞/の/[嗤り]/ば/こそ/拍/の/直の~/の/ナ/と/柊る/[らむ]
Figure 4: Tokenisation of poem texts
16. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 16
(C) meta-code conversion
CH-29-2130-01-010-A たつたひめ 羨弥 Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- 羨弥 -- Tatsuta Tatsuta
BG-01-2030-01-101-A -- -- hime princess
BG-02-3770-04-080-C たむくる 返鬚 tamukuru present(verb)
BG-01-5730-02-010-A -- 返 -- te hand
BG-02-1700-01-040-A -- 鬚韻 -- mukeru for
BG-01-2030-01-030-A かみ 舞 kami god
BG-08-0061-07-010-A の の no SUB (particle)
BG-02-1200-01-010-C あれ 嗤り are be
BG-08-0064-26-010-A ば ば ba because (particle)
BG-04-1120-05-150-A -- ば -- ba because (reason)
BG-08-0065-01-010-A こそ こそ koso KP (emphasis)
Figure 5: Meta-code conversion in case of OP
17. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 17
(C) Structure of meta-code-1
BG-01-2030-01-030-A-かみ-舞
● ● ● ● ● ● ● ●
(1) (2) (3) (4) (5) (6) (7) (8)
Figure 6: Structure of an item of BG database in the case of kami (god):
(1) database ID (BG = short-unit general vocabulary);
(2) part of speech ID (01 = noun);
(3) group ID (2030 = Shinto deities and Buddhas);
(4) ?eld ID;
(5) exact ID (030 = god);
(6) era-?ag (A = contemporary, C = classic);
(7) Chinese character reading;
(8) Chinese character
18. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 18
(C) Structure of the meta-code-2
BG-01-2600-01-020-A
yononaka (world)
(1) = BG-01-2610-01-040-A
yo (world)
(2)
+ BG-08-0010-01-021-A
no (of)
(3)
+ BG-01-1770-01-080-A
naka (inside)
(4)
Figure 7: Structure of an item of the semantic table in the case
of a compound word, yononaka (world)
19. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 19
(C) meta-code conversion-3
CH-29-2130-01-010-A たつたひめ 羨弥 Tatsutahime Princess-Tatsuta
CH-29-0000-14-010-A -- 羨弥 -- Tatsuta Tatsuta
BG-01-2030-01-101-A -- -- hime princess
BG-02-3770-04-080-C たむくる 返鬚 tamukuru present(verb)
BG-01-5730-02-010-A -- 返 -- te hand
BG-02-1700-01-040-A -- 鬚韻 -- mukeru for
BG-01-2030-01-030-A かみ 舞 kami god
BG-08-0061-07-010-A の の no SUB (particle)
BG-02-1200-01-010-C あれ 嗤り are be
BG-08-0064-26-010-A ば ば ba because (particle)
BG-04-1120-05-150-A -- ば -- ba because (reason)
BG-08-0065-01-010-A こそ こそ koso KP (emphasis)
Figure 8: Meta-code conversion in case of OP
20. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 20
poet write OP read expert reader
write
CT
read
novice reader
compare
10th century
Field of experience
20th century
Field of experience (expert)
20th century
Field of experience
(novice)
Figure 9: Schema of relationship between OP and CT
21. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 21
+-------- # of pair
| +----- value of matching level, exact=17, field=13, group=10
| | +-- # of POS
| | |
| | | # of element of OP ----+ +- # of element of CT
| | | element of OP -+ | | +--- element of CT
| | | | | | |
1 17 11 羨弥 00 <-> 12 弥 (Tatsutahime)
2 17 47 返 04 <-> 25 返 (hand)
3 17 47 鬚韻 05 <-> 26 鬚韻 (toward)
4 17 2 舞 06 <-> 32 舞 (god)
5 10 61 の 07 <-> 33 が (SUB)
6 17 47 嗤り 08 <-> 34 ある (be)
7 10 64 ば 09 <-> 35 から (because)
8 17 65 こそ 11 <-> 36 こそ (EM)
9 17 2 拍 12 <-> 38 拍 (autumn)
10 17 71 の 13 <-> 39 の (CON)
11 17 2 直の~ 14 <-> 40 直の~ (leaf of tree)
12 17 2 ナ 19 <-> 45 ナ (present)
13 17 61 と 20 <-> 46 と (CRD)
14 17 47 柊る 21 <-> 49 柊る (fall)
15 13 74 らむ 22 <-> 54 う (CJR)
Figure 10: Example of the matching process
22. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 22
Residual
CT (拍の挑除くなって「り祇についた)弥(が祇嶄のo並をって)返 鬚
OP ! !! ! ! ! ! ! ! ! 羨弥 ! ! ! ! ! ! ! 返鬚韻
CT (をする)舞があるからこそ拍の直の~(が)ナ(となって)柊っ(ているのだろ) う
OP ! ! 舞のあれ ば こそ拍の直の~[の]ナ と ! ! 柊る ! ! ! ! らめ
Figure 11: Example of the matching process in the case of kks 298 in Ko-
machiya (1982)
23. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 23
Components of OP
Table 2: Result of subtracting the elements of OP(298) from those
of CT(298, koma): it indicates the ratio of the ingredients
of OP(298).
OP (valid number of element) = 16
E (ratio of exact match) 12/16 = 0.750
F (ratio of field match) 1/16 = 0.062
G (ratio of group match) 2/16 = 0.125
T (ratio of total match) 15/16 = 0.938
U (ratio of unmatched OP) 1 - T = 0.062
24. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 24
Calculation of Residual Rate
D = 1 ?
P
T
(1)
= 1 ?
16
41
(2)
= 0.61 (3)
25. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 25
Components of CT
Table 3: Component of CT in case of kks 298 by Komachiya (1982):
fabs(D-H) stands for the function of the absolute value of the prac-
tical value, D, minus the theoretical value, H.
CT (valid number of element) =41
W (ratio of original word use) 12/41=0.293(E/CT)
A (ratio of annotation) 1-0.293=0.707(1-W)
---breakdown of the annotation---
P1(ratio of FG paraphrased) (0.62+0.12)/0.707=0.073(F+G)/A
P2(ratio of U paraphrased) (0.707-0.073)*0.062=0.040(A-P1)*U
D (ratio of purely added) 0.707-(0.073+0.040)=0.595A-(P1+P2)
H (theoretical value of D) 1-16/41=0.6101-OP/CT
Gap fabs(0.595-0.610)=0.015fabs(D-H)
26. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 26
Subtraction: CT - OP
Exact 12 (75.0%)
Field 1 (6.2%)
Group 2 (12.5%)
Unmatched 1 (6.2%)
W 12 (29.3%)
P1 3 (7.3%)
P2 1 (4.0%)
D 25 (59.5%)
OP : 16 elements CT : 41 elements(298) (298,koma)
Figure 12: Pie-charts illustrating the components of OP(298) and CT(298,
koma)
27. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 27
(E) Mathematical modelling
cw(t1, t2)=(1+log ctf(t1, t2))
〔
idf(t1) idf(t2) (4)
idf(t) = log
N
df(t)
(5)
28. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 28
warbler-CT-23-229-3.73-15 cuckoo-CT-40-370-3.27-16
every morning
field
8
warbler
17
old age
woven hat
6
10
green willow
4
wear in (my) hair
4
sew.26
spring
88
10
Tatsuta.PN
10
branch35
flower
138
stop.vi.1
15
break off
22
cry.vi
29
sing.vi
145
yet.1
30
summer
side 8
cuckoo39
a cry
8
May
42
Otowa.PN
20
voice
174
mountain110
261
singing voice
21
midsummer rain14
hear
69
be heard.1
37
last year
10
iris.1
7
treetop
9
12
20
20
11
this morning
29
9
19
go over
10
regret
10
treetop high.3
4
10
near
6
6226
reason.1
8
6
guidance.1
lure
4
9
send
4
separation
7
4
fragrance.1
7
20
10
spring haze
9
stand.vi
10
summer mountains
11
force
6
plum
10
56
23
44
mountain cuckoo
9
hide.vi.2
7
6
10
scatter.1
52
10
touch
10
hand
10
attach
5
flutter.2
6
6
borrow
19
imperceptibly
9
treetop high.1
7
7
far
5
29. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 29
Conclusion
The thesaurus annotated with meta-codes allows researchers
1. to identify di?erent orthographies as the same word;
2. to attach an alternative semantic ID to a word which has the
same form but has more than one meaning (polysemic word);
3. to attach meta-codes not only to tokens recognised as a
single/simple word but also to attach it to a longer size token
4. to indicate a similarity between tokens.
5. to detect common or di?erent tokens among more than one text,
which will tell us the similarities or di?erences between texts.
6. to indicate the relative di?erences between two words in literary
works.
30. TokyoTech and U. of Chicago Seminar 2013 Tokyo, Japan 30
Questions
? Computer Modelling of Classical Japanese Poetic
Vocabulary
http://warbler.ryu.titech.ac.jp/waka/poem.cgi
? Inquiry:
Hilofumi Yamamoto
yamagen@ryu.titech.ac.jp
? Thank you.