際際滷

際際滷Share a Scribd company logo
ANALYSIS OF IMAGES, SOCIAL NETWORKS,AND TEXTS
April, 9-11th, 2015, Yekaterinburg
Text Processing with Finite State
Transducers in Unitex
Artem Lukanin
This work is partially supported by the RFH grant #13-04-12020
New open electronic thesaurus for Russian.
What is Unitex?
 An open-source corpus processor, based on automata-oriented
technology
 mainly developed by S辿bastien Paumier at the Institut Gaspard-Monge
(IGM), University of Paris-Est Marne-la-Vall辿e (France)
 It works on Windows, Linux, Mac OS and other systems
 It has lexical resources for French, English, Greek, Portuguese, Russian,
Thai, Korean, Italian, Spanish, Norwegian, Arabic, German and more
 http://www-igm.univ-mlv.fr/~unitex/
2
What is corpus?
A corpus is a collection of pieces of language text in electronic form, selected
according to external criteria to represent, as far as possible, a language or
language variety as a source of data for linguistic research.
Sinclair 2005

3
What is Finite State Transducer (FST)?
FST, is a type of 鍖nite automaton which maps between two sets of symbols.
We can visualize an FST as a two-tape automaton that recognizes or
generates pairs of strings. Intuitively, we can do this by labeling each arc in
the 鍖nite-state machine with two symbol strings, one from each tape.
Jurafsky 2000

4
Simple sentence splitting FST
... 于 亠于亠亳仆仂仄 仗亠亳仂亟亠. 仂亳亞舒仍亳 于仂 ...
... 于 亠于亠亳仆仂仄 仗亠亳仂亟亠. {S} 仂亳亞舒仍亳 于仂 ...
5
Get your corpus from a text 鍖le in Unitex
1. Run Unitex
 If you are working on Windows, the program will ask you to choose a
personal working directory, which you can change later in
Info>Preferences...>Directories .
2. Select Russian as your working language
 For each language that you will be using, for the 鍖rst time the
program will copy the root directory of that language to your
personal directory, except the dictionaries.
6
Get your corpus from a text 鍖le in Unitex
3. Open corpus-ru-dbpedia-short-dea-1000.csv from the
Corpus subfolder: Text > Open...
4. Preprocess the text
 Apply Sentence.grf in MERGE mode
 Apply Replace.grf in REPLACE mode
 Tokenize the text
 Apply all default dictionaries
 Analyze unknown words as free compound words
7
Preprocessing
 Sentence.grf splits the text into sentences, adding {S} tag before
the next sentence (language dependent)
 Replace.grf removes 測 (soft hyphen) and converts no-break spaces
to spaces
 The standard separators (the space, the tab and the newline characters)
are normalized
8
Tokenization
 is language (alphabet) dependent
 Newlines in a text are replaced by spaces
 A token can be:
 the sentence delimiter {S}
 the stop marker {STOP} to delimit texts
 a lexical tag, e.g. {豫丕丕,.N+ORG+gen(M)}
 a contiguous sequence of letters (from alphabet.txt )
 one (and only one) non-letter character, e.g. a digit
9
Applying dictionaries
 consists of building the subset of dictionaries consisting only of forms
that are present in the text
 The corpus becomes "tagged", i.e. every token is assigned all possible
grammatical forms
 e.g. 亠仄 assigned these lexical tags:
亠仄,亠仄.N+anim(j)+gen(F):aeF
亠仄,.ADV
亠仄,亠仄.NUM+plur:t
10
Hyponyms and hypernyms
Unlike synonymy and antonymy, which are lexical relations between word
forms, hyponymy/hypernymy is a semantic relation between word meanings:
e.g., {maple} is a hyponym of {tree} , and {tree} is a hyponym of {plant} .
Much attention has been devoted to hyponymy/hypernymy (variously called
subordination/superordination, subset/superset, or the ISA relation)...

11
Hyponyms and hypernyms
A concept represented by the synset {x, x ,...} is said to be a hyponym of the
concept represented by the synset {y, y ,...} if native speakers of English accept
sentences constructed from such frames as An x is a (kind of) y. The relation
can be represented by including in {x, x ,...} a pointer to its superordinate, and
including in {y, y ,...} pointers to its hyponyms.
Miller 1993

12
Hyponym and hypernym mining from
Russian texts
舒仄仂仆  于仄亠亳亶 仂亟 仄仍亠从仂仗亳舒ム亳 亳亰 亠仄亠亶于舒
仍仂仆仂于, 亢亳于亳亶 于 亠于亠亳仆仂仄 仗亠亳仂亟亠.{S} 仂亳亞舒仍亳
于仂 5,5 仄亠仂于 亳 仄舒 亠仍舒 1012 仂仆仆.{S}
丐舒从亳仄 仂弍舒亰仂仄, 仄舒仄仂仆 弍仍亳 于 亟于舒 舒亰舒 礀亠仍亠亠 舒仄
从仗仆 仂于亠仄亠仆仆 仆舒亰亠仄仆 仄仍亠从仂仗亳舒ム亳 
舒亳从舒仆从亳 仍仂仆仂于 .
13
Indicators
舒仄仂仆  于仄亠亳亶 仂亟 仄仍亠从仂仗亳舒ム亳 亳亰 亠仄亠亶于舒
仍仂仆仂于, 亢亳于亳亶 于 亠于亠亳仆仂仄 仗亠亳仂亟亠.{S}
1. Text > Locate pattern...
2. Type 仂亟 into Regular expression
3. Select Index all utterances in text in Search limitation
4. Click Search
14
Concordance
 hyponyms and hypernyms are nouns
 于仄亠亳亶 (participle) and 亳仂从仂仍亳于亠仆仆 (adjective) can be
omitted
15
Patterns in Unitex
1. Text > Locate pattern...
2. Regular expression <N>  <V:S>* 仂亟 (<A>+<!DIC>)* <N>
3. Click Search
2 matches
舒仄仂仆  于仄亠亳亶 仂亟 仄仍亠从仂仗亳舒ム亳 亳亰 亠仄亠亶于舒 仍仂仆仂于
从  仂亟 亳仂从仂仍亳于亠仆仆 亟亠亠于亠于 亠仄亠亶于舒 从仂于亠
01.
02.
16
Lexical masks
 <仂亟> : matches all the entries that have 仂亟 as canonical form
 <舒.V> : matches all entries having 舒 as canonical form and
the grammatical code V
 <V> : matches all entries having the grammatical code V
 {舒仆,舒.V} or <舒仆,舒.V> : matches all the entries
having 舒仆 as in鍖ected form, 舒 as canonical form and the
grammatical code V
17
Lexical masks.Special symbols
 <E> : the empty word or epsilon. Matches the empty string
 <TOKEN> : matches any token, except the space; used by default for
morphological 鍖lters
 <MOT> : matches any token that consists of letters
 <MIN> : matches any lower-case token
 <MAJ> : matches any lower-case token
 <PRE> : matches any token that starts with a capital letter
18
Lexical masks.Special symbols
 <DIC> : matches any word that is present in the dictionaries of the text
 <SDIC> : matches any simple word in the text dictionaries
 <CDIC> : matches any composed word in the dictionaries of the text
 <TDIC> : matches any tagged token like {XXX,XXX.XXX}
 <NB> : matches any contiguous sequence of digit (1234 is matched but
not 1 234)
 <#> : prohibits the presence of space
19
Graphs in Unitex
 can match text (Finite State Automata)
 can produce new output text (Finite State Transducers)
 in MERGE mode combine the matched input text and the output text
(useful fot tagging)
 in REPLACE mode convert the matched input text into the output
text
20
1. FSGraph > New
2. Click on the initial state (arrow), click inside the empty place while
holding Ctrl to create a new box, connected to the initial state, type <N> ,
press Enter
21
A graph for matching text
3. Create a  box, connected to the <N> box
4. Create a 仂亟 box, connected to the  box
5. Create a <N> box, connected to the 仂亟 box
6. Click on the second <N> box, click on the 鍖nal state (a circle with a
square inside) to connect these 2 boxes
7. Create a <V:S> box between the  and 仂亟 boxes
8. Create a <A>+<!DIC> box between the 仂亟 and <N> boxes
9. Save the graph as Graphs/match-hyponyms.grf : FSGraph > Save
22
A graph for matching text
Text > Locate Pattern... , Locate pattern in the form of: Graph, Set
match-hyponyms.grf , Search
23
Transducers in Unitex
1. Click on the 鍖rst <N> box (hyponym) and change it to <N>/{[ to add
{[ before the matched noun, when the graph is applied in the MERGE
mode
2. Click on the <N>/{[ and click on the  box to disconnect these boxes
3. Create a <E>/]=HYPONYM} box between the <N>/{[ and  boxes.
It will add ]=HYPONYM} after the matched noun
4. Modify the second <N> box for adding a HYPERNYM tag to it
24
Transducers in Unitex
5. Save the graph as tag-hyponyms.grf
25
Tagging hyponyms and hypernyms
1. Text > Locate pattern...
2. Set tag-hyponyms.grf
3. Select Merge with input text in Grammar outputs
4. Click Search
5. Build concordance
 The matched and tagged texts are stored in the concord.ind
鍖le in the corpus folder
corpus-ru-dbpedia-short-dea-1000_snt
26
Tagging hyponyms and hypernyms
{[舒仄仂仆]=HYPONYM}  于仄亠亳亶 仂亟
{[仄仍亠从仂仗亳舒ム亳]=HYPERNYM} 亳亰 亠仄亠亶于舒 仍仂仆仂于
{[从]=HYPONYM}  仂亟 亳仂从仂仍亳于亠仆仆
{[亟亠亠于亠于]=HYPERNYM} 亠仄亠亶于舒 从仂于亠
 We can then use some script to extract tagged hyponyms and
hypernyms...
 or mine them right in Unitex in the REPLACE mode
01.
02.
27
Mining hyponyms and hypernyms
1. Open match-hyponyms.grf : FSGraph > Open...
2. Click on the 鍖rst <N> box, right-click on it and select
Surround with > Morphological mode
3. Click on the 鍖rst <N> box and change it to <N>/$hyponym$ to store
the matched noun with all morphological information in the
$hyponym$ variable
28
Mining hyponyms and hypernyms
4. Modify the second <N> box to store the matched noun in variable
$hypernym$ in the morphological mode
5. Add <E>/$hypernym.LEMMA$: $hyponym.LEMMA$ before the
鍖nal state
6. Save this graph as mine-hyponyms.grf
7. In Info > Preferences... > Morphological dictionaries add
Dela/CISLEXru_igrok.bin
29
Mining hyponyms and hypernyms
30
Mining hyponyms and hypernyms
1. Set this graph in Text > Locate pattern...
2. Select Replace recognized sequences in Grammar outputs
3. Click Search
仄仍亠从仂仗亳舒ム亠亠: 仄舒仄仂仆
亟亠亠于仂: 弍从
亟亠亠于仂: 弍从舒
亟亠亠于仂: 从
01.
02.
03.
04.
31
Mining hyponyms and hypernyms
1. Why so many 从 outputs? Let's see in the dictionary: DELA >
Lookup... , select CISLEXru_igrok.bin and enter this word
从,.N+FAMN+PN+anim(o)+gen(M):neM
从,.N+FAMN+PN+anim(o)+gen(F):neF:geF:deF:aeF:teF:qeF:nm:gm:d
弍从,弍从舒.N+anim(o)+gen(F)+gen(M):gm:aom
弍从,.N+anim(j)+gen(M):neM:ajeM
32
Mining hyponyms and hypernyms
2. Let's modify mine-hyponyms.grf to remove ambiguous outputs:
change the 鍖rst <N> box to <N~PN:n>
2 outputs
仄仍亠从仂仗亳舒ム亠亠: 仄舒仄仂仆
亟亠亠于仂: 弍从
01.
02.
33
References
1. Jurafsky, D., & James, H. (2000). Speech and language processing an
introduction to natural language processing, computational linguistics,
and speech.
2. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990).
Introduction to wordnet: An on-line lexical database*. International
journal of lexicography, 3(4), 235-244.
34
References
3. Paumier, S. (2015). Unitex 3.1.beta User Manual. Universit辿 Paris-Est
Marne-la-Vall辿e. January 15, 2015,
http://igm.univ-mlv.fr/~unitex/UnitexManual3.1.pdf
4. Sinclair, J. (2005)."Corpus and Text - Basic Principles" in Developing
Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow
Books: 1-16. Available online from
http://ahds.ac.uk/linguistic-corpora/ [Accessed 2015-04-01].
35
Text Processing in Unitex
 PatternSim (github.com/cental/PatternSim)  a tool for calculation
semantic similarity between words from a text corpus based on lexico-
syntactic patterns
 Normatex (github.com/avlukanin/normatex)  Russian text normalization
for speech synthesis, machine translation and other natural language
processing tasks
 Unitext Tutorial (github.com/avlukanin/unitextutorial)  the slides and
source 鍖les used in this tutorial
36
Text Processing with Finite State
Transducers in Unitex
Artem Lukanin
 about.me/alukanin
 @avlukanin
 artyom.lukanin@gmail.com
際際滷s: artyom.ice-lc.com/slides/unitextutorial
37

More Related Content

Text Processing with Finite State Transducers in Unitex

  • 1. ANALYSIS OF IMAGES, SOCIAL NETWORKS,AND TEXTS April, 9-11th, 2015, Yekaterinburg Text Processing with Finite State Transducers in Unitex Artem Lukanin This work is partially supported by the RFH grant #13-04-12020 New open electronic thesaurus for Russian.
  • 2. What is Unitex? An open-source corpus processor, based on automata-oriented technology mainly developed by S辿bastien Paumier at the Institut Gaspard-Monge (IGM), University of Paris-Est Marne-la-Vall辿e (France) It works on Windows, Linux, Mac OS and other systems It has lexical resources for French, English, Greek, Portuguese, Russian, Thai, Korean, Italian, Spanish, Norwegian, Arabic, German and more http://www-igm.univ-mlv.fr/~unitex/ 2
  • 3. What is corpus? A corpus is a collection of pieces of language text in electronic form, selected according to external criteria to represent, as far as possible, a language or language variety as a source of data for linguistic research. Sinclair 2005 3
  • 4. What is Finite State Transducer (FST)? FST, is a type of 鍖nite automaton which maps between two sets of symbols. We can visualize an FST as a two-tape automaton that recognizes or generates pairs of strings. Intuitively, we can do this by labeling each arc in the 鍖nite-state machine with two symbol strings, one from each tape. Jurafsky 2000 4
  • 5. Simple sentence splitting FST ... 于 亠于亠亳仆仂仄 仗亠亳仂亟亠. 仂亳亞舒仍亳 于仂 ... ... 于 亠于亠亳仆仂仄 仗亠亳仂亟亠. {S} 仂亳亞舒仍亳 于仂 ... 5
  • 6. Get your corpus from a text 鍖le in Unitex 1. Run Unitex If you are working on Windows, the program will ask you to choose a personal working directory, which you can change later in Info>Preferences...>Directories . 2. Select Russian as your working language For each language that you will be using, for the 鍖rst time the program will copy the root directory of that language to your personal directory, except the dictionaries. 6
  • 7. Get your corpus from a text 鍖le in Unitex 3. Open corpus-ru-dbpedia-short-dea-1000.csv from the Corpus subfolder: Text > Open... 4. Preprocess the text Apply Sentence.grf in MERGE mode Apply Replace.grf in REPLACE mode Tokenize the text Apply all default dictionaries Analyze unknown words as free compound words 7
  • 8. Preprocessing Sentence.grf splits the text into sentences, adding {S} tag before the next sentence (language dependent) Replace.grf removes 測 (soft hyphen) and converts no-break spaces to spaces The standard separators (the space, the tab and the newline characters) are normalized 8
  • 9. Tokenization is language (alphabet) dependent Newlines in a text are replaced by spaces A token can be: the sentence delimiter {S} the stop marker {STOP} to delimit texts a lexical tag, e.g. {豫丕丕,.N+ORG+gen(M)} a contiguous sequence of letters (from alphabet.txt ) one (and only one) non-letter character, e.g. a digit 9
  • 10. Applying dictionaries consists of building the subset of dictionaries consisting only of forms that are present in the text The corpus becomes "tagged", i.e. every token is assigned all possible grammatical forms e.g. 亠仄 assigned these lexical tags: 亠仄,亠仄.N+anim(j)+gen(F):aeF 亠仄,.ADV 亠仄,亠仄.NUM+plur:t 10
  • 11. Hyponyms and hypernyms Unlike synonymy and antonymy, which are lexical relations between word forms, hyponymy/hypernymy is a semantic relation between word meanings: e.g., {maple} is a hyponym of {tree} , and {tree} is a hyponym of {plant} . Much attention has been devoted to hyponymy/hypernymy (variously called subordination/superordination, subset/superset, or the ISA relation)... 11
  • 12. Hyponyms and hypernyms A concept represented by the synset {x, x ,...} is said to be a hyponym of the concept represented by the synset {y, y ,...} if native speakers of English accept sentences constructed from such frames as An x is a (kind of) y. The relation can be represented by including in {x, x ,...} a pointer to its superordinate, and including in {y, y ,...} pointers to its hyponyms. Miller 1993 12
  • 13. Hyponym and hypernym mining from Russian texts 舒仄仂仆 于仄亠亳亶 仂亟 仄仍亠从仂仗亳舒ム亳 亳亰 亠仄亠亶于舒 仍仂仆仂于, 亢亳于亳亶 于 亠于亠亳仆仂仄 仗亠亳仂亟亠.{S} 仂亳亞舒仍亳 于仂 5,5 仄亠仂于 亳 仄舒 亠仍舒 1012 仂仆仆.{S} 丐舒从亳仄 仂弍舒亰仂仄, 仄舒仄仂仆 弍仍亳 于 亟于舒 舒亰舒 礀亠仍亠亠 舒仄 从仗仆 仂于亠仄亠仆仆 仆舒亰亠仄仆 仄仍亠从仂仗亳舒ム亳 舒亳从舒仆从亳 仍仂仆仂于 . 13
  • 14. Indicators 舒仄仂仆 于仄亠亳亶 仂亟 仄仍亠从仂仗亳舒ム亳 亳亰 亠仄亠亶于舒 仍仂仆仂于, 亢亳于亳亶 于 亠于亠亳仆仂仄 仗亠亳仂亟亠.{S} 1. Text > Locate pattern... 2. Type 仂亟 into Regular expression 3. Select Index all utterances in text in Search limitation 4. Click Search 14
  • 15. Concordance hyponyms and hypernyms are nouns 于仄亠亳亶 (participle) and 亳仂从仂仍亳于亠仆仆 (adjective) can be omitted 15
  • 16. Patterns in Unitex 1. Text > Locate pattern... 2. Regular expression <N> <V:S>* 仂亟 (<A>+<!DIC>)* <N> 3. Click Search 2 matches 舒仄仂仆 于仄亠亳亶 仂亟 仄仍亠从仂仗亳舒ム亳 亳亰 亠仄亠亶于舒 仍仂仆仂于 从 仂亟 亳仂从仂仍亳于亠仆仆 亟亠亠于亠于 亠仄亠亶于舒 从仂于亠 01. 02. 16
  • 17. Lexical masks <仂亟> : matches all the entries that have 仂亟 as canonical form <舒.V> : matches all entries having 舒 as canonical form and the grammatical code V <V> : matches all entries having the grammatical code V {舒仆,舒.V} or <舒仆,舒.V> : matches all the entries having 舒仆 as in鍖ected form, 舒 as canonical form and the grammatical code V 17
  • 18. Lexical masks.Special symbols <E> : the empty word or epsilon. Matches the empty string <TOKEN> : matches any token, except the space; used by default for morphological 鍖lters <MOT> : matches any token that consists of letters <MIN> : matches any lower-case token <MAJ> : matches any lower-case token <PRE> : matches any token that starts with a capital letter 18
  • 19. Lexical masks.Special symbols <DIC> : matches any word that is present in the dictionaries of the text <SDIC> : matches any simple word in the text dictionaries <CDIC> : matches any composed word in the dictionaries of the text <TDIC> : matches any tagged token like {XXX,XXX.XXX} <NB> : matches any contiguous sequence of digit (1234 is matched but not 1 234) <#> : prohibits the presence of space 19
  • 20. Graphs in Unitex can match text (Finite State Automata) can produce new output text (Finite State Transducers) in MERGE mode combine the matched input text and the output text (useful fot tagging) in REPLACE mode convert the matched input text into the output text 20
  • 21. 1. FSGraph > New 2. Click on the initial state (arrow), click inside the empty place while holding Ctrl to create a new box, connected to the initial state, type <N> , press Enter 21
  • 22. A graph for matching text 3. Create a box, connected to the <N> box 4. Create a 仂亟 box, connected to the box 5. Create a <N> box, connected to the 仂亟 box 6. Click on the second <N> box, click on the 鍖nal state (a circle with a square inside) to connect these 2 boxes 7. Create a <V:S> box between the and 仂亟 boxes 8. Create a <A>+<!DIC> box between the 仂亟 and <N> boxes 9. Save the graph as Graphs/match-hyponyms.grf : FSGraph > Save 22
  • 23. A graph for matching text Text > Locate Pattern... , Locate pattern in the form of: Graph, Set match-hyponyms.grf , Search 23
  • 24. Transducers in Unitex 1. Click on the 鍖rst <N> box (hyponym) and change it to <N>/{[ to add {[ before the matched noun, when the graph is applied in the MERGE mode 2. Click on the <N>/{[ and click on the box to disconnect these boxes 3. Create a <E>/]=HYPONYM} box between the <N>/{[ and boxes. It will add ]=HYPONYM} after the matched noun 4. Modify the second <N> box for adding a HYPERNYM tag to it 24
  • 25. Transducers in Unitex 5. Save the graph as tag-hyponyms.grf 25
  • 26. Tagging hyponyms and hypernyms 1. Text > Locate pattern... 2. Set tag-hyponyms.grf 3. Select Merge with input text in Grammar outputs 4. Click Search 5. Build concordance The matched and tagged texts are stored in the concord.ind 鍖le in the corpus folder corpus-ru-dbpedia-short-dea-1000_snt 26
  • 27. Tagging hyponyms and hypernyms {[舒仄仂仆]=HYPONYM} 于仄亠亳亶 仂亟 {[仄仍亠从仂仗亳舒ム亳]=HYPERNYM} 亳亰 亠仄亠亶于舒 仍仂仆仂于 {[从]=HYPONYM} 仂亟 亳仂从仂仍亳于亠仆仆 {[亟亠亠于亠于]=HYPERNYM} 亠仄亠亶于舒 从仂于亠 We can then use some script to extract tagged hyponyms and hypernyms... or mine them right in Unitex in the REPLACE mode 01. 02. 27
  • 28. Mining hyponyms and hypernyms 1. Open match-hyponyms.grf : FSGraph > Open... 2. Click on the 鍖rst <N> box, right-click on it and select Surround with > Morphological mode 3. Click on the 鍖rst <N> box and change it to <N>/$hyponym$ to store the matched noun with all morphological information in the $hyponym$ variable 28
  • 29. Mining hyponyms and hypernyms 4. Modify the second <N> box to store the matched noun in variable $hypernym$ in the morphological mode 5. Add <E>/$hypernym.LEMMA$: $hyponym.LEMMA$ before the 鍖nal state 6. Save this graph as mine-hyponyms.grf 7. In Info > Preferences... > Morphological dictionaries add Dela/CISLEXru_igrok.bin 29
  • 30. Mining hyponyms and hypernyms 30
  • 31. Mining hyponyms and hypernyms 1. Set this graph in Text > Locate pattern... 2. Select Replace recognized sequences in Grammar outputs 3. Click Search 仄仍亠从仂仗亳舒ム亠亠: 仄舒仄仂仆 亟亠亠于仂: 弍从 亟亠亠于仂: 弍从舒 亟亠亠于仂: 从 01. 02. 03. 04. 31
  • 32. Mining hyponyms and hypernyms 1. Why so many 从 outputs? Let's see in the dictionary: DELA > Lookup... , select CISLEXru_igrok.bin and enter this word 从,.N+FAMN+PN+anim(o)+gen(M):neM 从,.N+FAMN+PN+anim(o)+gen(F):neF:geF:deF:aeF:teF:qeF:nm:gm:d 弍从,弍从舒.N+anim(o)+gen(F)+gen(M):gm:aom 弍从,.N+anim(j)+gen(M):neM:ajeM 32
  • 33. Mining hyponyms and hypernyms 2. Let's modify mine-hyponyms.grf to remove ambiguous outputs: change the 鍖rst <N> box to <N~PN:n> 2 outputs 仄仍亠从仂仗亳舒ム亠亠: 仄舒仄仂仆 亟亠亠于仂: 弍从 01. 02. 33
  • 34. References 1. Jurafsky, D., & James, H. (2000). Speech and language processing an introduction to natural language processing, computational linguistics, and speech. 2. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. J. (1990). Introduction to wordnet: An on-line lexical database*. International journal of lexicography, 3(4), 235-244. 34
  • 35. References 3. Paumier, S. (2015). Unitex 3.1.beta User Manual. Universit辿 Paris-Est Marne-la-Vall辿e. January 15, 2015, http://igm.univ-mlv.fr/~unitex/UnitexManual3.1.pdf 4. Sinclair, J. (2005)."Corpus and Text - Basic Principles" in Developing Linguistic Corpora: a Guide to Good Practice, ed. M. Wynne. Oxford: Oxbow Books: 1-16. Available online from http://ahds.ac.uk/linguistic-corpora/ [Accessed 2015-04-01]. 35
  • 36. Text Processing in Unitex PatternSim (github.com/cental/PatternSim) a tool for calculation semantic similarity between words from a text corpus based on lexico- syntactic patterns Normatex (github.com/avlukanin/normatex) Russian text normalization for speech synthesis, machine translation and other natural language processing tasks Unitext Tutorial (github.com/avlukanin/unitextutorial) the slides and source 鍖les used in this tutorial 36
  • 37. Text Processing with Finite State Transducers in Unitex Artem Lukanin about.me/alukanin @avlukanin artyom.lukanin@gmail.com 際際滷s: artyom.ice-lc.com/slides/unitextutorial 37