狠狠撸s from my talk at LaTeCH 2015 (The 9th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, held in conjunction with ACL-IJCNLP 2015) in Beijing, 30 July 2015. The full paper can be found in the ACL Anthology at http://aclweb.org/anthology/W/W15/W15-3704.pdf
1 of 37
More Related Content
LaTeCH 2015: Measuring the Structural and Conceptual Similarity of Folktales using Plot Graphs (Lestari & Manurung)
1. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Measuring the Structural and
Conceptual Similarity
of Folktales using Plot Graphs
Victoria Anugrah Lestari & Ruli Manurung
Faculty of Computer Science
Universitas Indonesia
victoria.anugrah@ui.ac.id, maruli@cs.ui.ac.id
Beijing, China
30 July 2015
2. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Folktales
Folktales are a characteristically anonymous, timeless,
and placeless tale circulated orally among a people.
http://onceuponatime.wikia.com/wiki/Rumpelstiltskin_(Fairytale)
http://indonesianfolklore.blogspot.com/2007/10/lutung-kasarung-folklore-from-west-java.html
http://indonesianfolklore.blogspot.com/2007/10/keong-emas-golden-snail-prince-raden.html 2/24
3. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Humanities work on folktales
? Vladimir Propp (1928): Morphology of the
(Russian) folktale ? story grammars
? Aarne-Thompson-Uther (ATU) index (1910,
1961, 2004): story motifs, hierarchy of folktale
types
3/24
4. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Computational work on folktales
? Vaz Lobo & de Matos (2010): latent semantic mapping +
clustering 453 fairy tales from Gutenberg.
? Nguyen et al. (2012): classification based on genre, e.g.
legend, fairytale, jokes, puzzle, urban legend, etc. using
lexical, POS, NE, metadata.
? Nguyen et al. (2013): Ranking based on story types (ATU,
Brunvand) using IR, lexical, SVO triplets.
? Karsdorp & van den Bosch (2013): Topic modelling (L-LDA) for
multiple labelling of ATU motifs (defined by types).
4/24
5. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Folktales as narratives
? Narratives: Focus on sequence of related
events ? structure
? Models of narrative: Turner (1994), Mateas &
Stern (2003), Pérez y Pérez & Sharples (2004),
etc.
5/24
6. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Folktales as narratives
? Narratives: Focus on sequence of related
events ? structure
? Models of narrative: Turner (1994), Mateas &
Stern (2003), Pérez y Pérez & Sharples (2004),
etc.
? However: Fisseni & L?we (2012): People tend
to focus on motifs & content, less on
structure.
5/24
8. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Goals of this work
? Construct representations that capture
structural & conceptual properties.
? Define similarity metric, use to organize
folktales.
? Compare to BoW-based methods wrt. ATU.
7/24
16. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Example
live sleep
lion forest it tree
subj in subj under
9/24
17. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Example
live sleep come
lion forest it tree mouse
subj in subj under subj
9/24
18. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Example
live sleep come play
lion forest it tree mouse lionit
subj in subj under subj subj on
9/24
19. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Automatic construction
Stanford CoreNLP SemanticGraph (a.k.a.
dependency parse)
10/24
20. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
From SemanticGraph to plot graph
Some observation-based heuristics on selecting relations:
? Governors of nsubj (nominal subject), expl (expletive “there”), and aux (auxiliary)
? Add child if relation(parent,child) not conj, comp, adv, aux, cop, dep, expl, mark
11/24
26. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Measuring plot graph similarity
A lion lives in the
forest. One day it
sleeps under a
tree. Then a
mouse plays on
the lion and
disturbs its sleep.
A lion eats meat. A
lion lives in the
jungle. One day it
rests under a tree.
14/24
27. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Measuring plot graph similarity
A lion lives in the
forest. One day it
sleeps under a
tree. Then a
mouse plays on
the lion and
disturbs its sleep.
A lion eats meat. A
lion lives in the
jungle. One day it
rests under a tree.
14/24
28. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Alignment of event sequence
Needleman-
Wunsch
15/24
29. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Conceptual similarity: Wu-Palmer
Measure path
distance
between 2
words based on
WordNet
taxonomy
Word pairs Similarity
sleep, live 0.25
disturb, rest 0.33
live, eat 0.29
prince, king 0.94
jungle, forest 0.31
palace, house 0.91
16/24
30. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Example mapping
eat live rest
live 0.29 1 0.33
sleep 0.22 0.25 0.43
play 0.29 0.33 0.43
disturb 0.29 0.33 0.33
eat live rest
0 -1 -2 -3
live -1 0.29 0 -1
sleep -2 -0.71 0.54 1
play -3 -1.71 -0.38 0.96
disturb -4 -2.71 -1.38 -0.04
Wu-Palmer similarity
Alignment scoring & traceback matrix
17/24
31. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Folktale similarity measurement
p1 & p2 = the two plot graphs being compared
α = weighting for action node similarity
β = weighting for child node similarity
(a1i ,a2i ) = pair of action nodes from alignment of p1 and p2
g = gap penalty
(c1i ,c2i ) = pair of child nodes from alignment of p1 and p2
n = alignment length of p1 and p2
18/24
32. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Initial experiment
? Determining values for α, β, and g
? For each story, 5 paraphrases manually created: word
replacement, sentence structure change, insertion/deletion of
phrases & sentences
? Measure similarity between paraphrases & across stories.
Maximize difference.
No. Title #Words
1 A friend in need is a friend indeed 133
2 Honesty is the best policy 129
3 A town mouse and a country mouse 260
4 How to tell a true princess 382
5 The butterfly lovers 572
6 Rumpelstiltskin 1106
http://www.english-for-students.com/Simple-Short-Stories.html 19/24
33. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Similarity scores using various parameters
g=
α = 0.7, β = 0.3 α = 0.5, β = 0.5 α = 0.3, β = 0.7
-1 -0.5 0 -1 -0.5 0 -1 -0.5 0
Between
paraphrases
Avg 0.83 0.80 0.74 0.83 0.80 0.73 0.83 0.79 0.71
Min 0.69 0.61 0.53 0.69 0.60 0.49 0.68 0.58 0.45
Across
stories
Avg 0.37 0.30 0.15 0.41 0.32 0.12 0.45 0.33 0.09
Max 0.55 0.45 0.25 0.55 0.43 0.20 0.55 0.42 0.16
BP min - AS max 0.14 0.16 0.28 0.14 0.17 0.29 0.13 0.16 0.29
Diff. between avgs 0.46 0.50 0.59 0.42 0.48 0.61 0.38 0.46 0.62
20/24
34. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Main experiment: BoW comparison
24 fairy tales from Fairy Books of Andrew Lang, grouped into 5 clusters under
ATU (fairy tales):
? Supernatural Adversaries — Bluebeard; Hansel and Gretel; Jack and the
Beanstalk; Rapunzel; The Twelve Dancing Princesses.
? Supernatural or Enchanted Relatives — Beauty and the Beast; Brother
and Sister; East of the Sun, West of the Moon; Snow White and Rose Red;
The Bushy Bride; The Six Swans; The Sleeping Beauty.
? Supernatural Helpers — Cinderella; Donkey Skin; Puss in Boots;
Rumpelstiltskin; The Goose Girl; The Story of Sigurd.
? Magic Objects — Aladdin and the Wonderful Lamp; Fortunatus and His
Purse; The Golden Goose; The Magic Ring.
? Other Stories of the Supernatural — Little Thumb; The Princess and the
Pea.
Measure similarity between clusters & across clusters.
http://www.gutenberg.org/ebooks/30580 21/24
35. Beijing
30 July ‘15
Folktales Plot graphs Similarity ExperimentsStory type Story
Plot graph Bag of words Combination
Within Across Within Across Within Across
Supernatural
adversaries
Bluebeard 0.1000 0.1037 0.8629 0.8618 0.4814 0.4586
Hansel and Gretel 0.1075 0.1157 0.8492 0.8630 0.4783 0.4894
Jack and the Beanstalk 0.1050 0.1110 0.9050 0.8891 0.5050 0.5001
Rapunzel 0.1000 0.1047 0.8790 0.8575 0.4895 0.4571
The Twelve Dancing Princesses 0.1125 0.1073 0.8808 0.8631 0.4966 0.4610
Supernatural
or enchanted
relatives
Beauty and the Beast 0.0767 0.0705 0.8803 0.8605 0.4785 0.4397
Brother and Sister 0.1233 0.1135 0.8881 0.8722 0.5057 0.4654
East of the Sun, West of the Moon 0.1117 0.1012 0.8914 0.8571 0.5015 0.4525
Snow White and Rose Red 0.1200 0.1165 0.8650 0.8566 0.4925 0.4865
The Bushy Bride 0.1200 0.1182 0.8862 0.8739 0.5031 0.4960
The Six Swans 0.0925 0.1100 0.9006 0.8662 0.5020 0.4881
The Sleeping Beauty 0.1125 0.1194 0.8990 0.8918 0.5087 0.5056
Supernatural
helpers
Cinderella 0.1180 0.1144 0.8150 0.8306 0.4665 0.4725
Donkey Skin 0.1040 0.1122 0.8873 0.9025 0.4956 0.5074
Puss in Boots 0.1175 0.1095 0.8170 0.8486 0.4672 0.4551
Rumpelstiltskin 0.0750 0.0858 0.8467 0.8569 0.4609 0.4478
The Goose Girl 0.1240 0.1178 0.8617 0.8624 0.4928 0.4643
The Story of Sigurd 0.1080 0.1178 0.8516 0.8670 0.4800 0.4664
Magic objects
Aladdin and the Wonderful Lamp 0.0975 0.0910 0.8958 0.8664 0.4946 0.4559
Fortunatus and His Purse 0.1133 0.1185 0.8945 0.8306 0.5039 0.4519
The Golden Goose 0.1033 0.1155 0.9006 0.8529 0.5012 0.4611
The Magic Ring 0.1033 0.1040 0.9120 0.8960 0.5077 0.4762
Other stories
Little Thumb 0.0300 0.1214 0.7444 0.8562 0.3872 0.4675
The Princess and the Pea 0.0300 0.0405 0.7444 0.7844 0.3872 0.3945
# Similarity within > across 10 (41.67%) 15 (62.50%) 19 (79.16%)
36. Beijing
30 July ‘15
Folktales Plot graphs Similarity Experiments
Analysis & Discussion
? Errors in automatic construction (dependency
parses aren’t really semantic graphs), e.g.:
“along came a mouse” vs. “a mouse came”,
coreference errors.
? Consistent with Fisseni & L?we (2012)
findings: focus more on content & motifs?
? Combination of plot graph + BoW yields best
results.
23/24