際際滷

際際滷Share a Scribd company logo
ANALYSIS OF IMAGES, SOCIAL NETWORKS,AND TEXTS
April, 9-11th, 2015, Yekaterinburg
Normalization of Non-Standard Words
with Finite State Transducers
for Russian Speech Synthesis
Artem Lukanin
Text Preprocessing for Speech Synthesis
 is usually a very complex task
 Text normalization is one of the steps in text preprocessing [1]
 sentence segmentation
 tokenization
 normalization of non-standard words (NSWs)
 numbers, abbreviations, and acronyms
 different characters like % , $ , # ,  , etc.
2
Normalization of Non-Standard Words
 NSWs must be expanded into full SW to be pronounced correctly
 It's even more complex in in鍖ective languages such as Russian
 ordinal number can be converted into 36 different word forms (6
cases * 2 numers * 3 genders)
 digit position changes the output standard word
 1111 1  仗亠于亶
 111 11  仂亟亳仆仆舒亟舒亶
 11 1 11  仂
 1 1 111  舒
 11 111  仂亟亳仆仆舒亟舒

3
Existing Russian Normalization Systems
 As a part of proprietory Text-to-Speech (TTS) systems
 Google Translate, https://translate.google.ru/
 VitalVoice, http://cards.voicefabric.ru/
 Windows SAPI voices, etc.
 As a part of open-source TTS systems
 Festival [2]
 only digit-by-digit number normalization for the Russian voice
4
Normatex
 is the 鍖rst Russian open-source normalization system, known to the
author, github.com/avlukanin/normatex
 If the input texts are normalized beforehand the quality of the
synthesized speech of existing TTS systems can be improved
 118 鍖nite state transducers (FSTs) for conversion of cardinal and ordinal
numbers into the corresponding numerals, which can preprocess
different ranges, time, dates, telephone numbers, postal codes, etc.
 33 FSTs for normalization of graphic abbreviations and acronyms
5
Test Parallel Corpus
 66 original texts of the of鍖cial site of South Ural State University,
susu.ac.ru, which contains 38,439 tokens (broad segmentation units [3]):
 14,661 word tokens
 333 acronyms and 98 initials; 379 graphic abbreviations
 977 number tokens (2,511 digits)
 66 manually preprocessed texts, where all numbers, abbreviations and
acronyms were expanded into full words or replaced with pronounceable
combination of letters
6
Finite State Transducers
 are developed in the form of graphs in Unitex 3.1beta
 Before applying FSTs to a text, it is preprocessed:
 The text is splitted into sentences
 The text is tokenized
 Every token is assigned all possible grammatical forms
 Number FSTs are applied 鍖rst to deal with numbers and measure unit
abbreviations
 Abbreviation FSTs and acronym FSTs are applied sequentially after that
7
Cardinal Numbers
 agree with nouns in case, but the numerals 仂亟亳仆 one and 亟于舒 two
agree in gender as well
 all the constituent words of a compound numeral agree with the
corresponding noun: 亟于舒亟舒亳 仂亟仆仂亞仂 and 亟于舒亟舒亳 仂亟仆仂亶 (twenty-
one in gen. m. and f.)
 仂亟仆亳 (one in plural) agrees only with pluralia tantum, e.g. 仂亟仆亳
仆仂亢仆亳 one pair of scissors, 仂亟仆亳 弍ミ歳 one pair of pants [4]
8
5-9ncard 5
仗
6
亠
7
亠仄
8
于仂亠仄
9
亟亠于
9
2x-9xncard
2
亟于舒亟舒
3
亳亟舒
4
仂仂从
5
仗亟亠
6
亠亟亠
7
亠仄亟亠
8
于仂亠仄亟亠
9
亟亠于礌仂仂
10
NUM-5-9-ncard
5足9ncard
2x足9xncard
10足19ncard
仗仂弍亠仍
0
NUMxx足ncard
0
仗仂弍亠仍
11
units
NUM足1足ncard
NUM足2足ncard
NUM足5足9足ncard
NUM足3足4足ncard
units足1
<N:g>
<A:g>[ ]
units足2足4
""
""
亳亰 足2足9足gcard
*
足1m足gcard
12
Ordinal Numbers
 Simple ordinal numerals agree with nouns in gender, case and number
 In compound ordinal numerals only the last constituent word agrees with
the noun [5]: 亟于亠 亳 亠仆舒亟舒仂仄 (two thousand fourteenth in
prepositional masculine)
 Complex ordinal numbers, ending in -00,-000,-000000,-000000000, are
written without spaces: 153000 is converted into
仂仗亳亟亠亳仆亶 one hundred and 鍖fty-three thousandth
in nominative masculine
13
Ordinal Numbers
 Only the last constituent words -仂亶 hundredth, -仆亶
thousandth, -仄亳仍仍亳仂仆仆亶 millionth, -仄亳仍仍亳舒亟仆亶 billionth
agree with the nouns
 The words, preceding the last word, are used in genitive plural (the
exceptions are 仂 one hundred and 亟亠于礌仂仂 ninety, which are
used in the nominative case) [6]
14
Acronyms
 Most acronyms should be converted into full words before speech
synthesis, because it is dif鍖cult for people to comprehend a letter-by-
letter pronunciation in speech and because acronyms are often rare for
everybody to know what phrase the acronym corresponds to
个丕  束豫丕丕損 (丕)  个亠亟亠舒仍仆仂亠 亞仂亟舒于亠仆仆仂亠
弍ミ莞勤却仆仂亠 仂弍舒亰仂于舒亠仍仆仂亠 亠亢亟亠仆亳亠 于亠亞仂
仗仂亠亳仂仆舒仍仆仂亞仂 仂弍舒亰仂于舒仆亳 束豫亢仆仂-丕舒仍从亳亶 亞仂亟舒于亠仆仆亶
仆亳于亠亳亠損 (舒仆仂-亳仍亠亟仂于舒亠仍从亳亶 仆亳于亠亳亠)
个丕  束豫丕丕損 (丕)
15
Acronyms
 The main component of an acronym is a noun, that is why there can be
12 possible forms of the converted phrase (six cases and two numbers) in
Russian
 There are rules for all six cases in Normatex
 Acronyms can be ambiguous in different corpora
 For all ambiguous or unknown acronyms Normatex substitutes each
letter with its alphabet name:   
16
Graphic Abbreviations
 Single interpretation: 亳 .亟. etc.  亳 舒从 亟舒仍亠亠 , .亠. i.e.  仂 亠
 The interpretation depends on the context: 亳 亟. et al.  亳 亟亞亳亠
and others, 亳 亟亞亳 and others, 亳 亟亞亳仄 and others, 亳 亟亞仂亠
and other
 Ambiguous: 亞.  亞仂亟 year, 亞仂仂亟 city, 亞舒仄仄 gram (every noun
can have 12 word forms), 亟亳仂亳: 339-亞, 339-亟 Room 339-g, 339-d
 Suf鍖cient left and right contexts should be provided in FSTs as well as
FSTs should be applied in a de鍖nite order
17
Results
Token type Tokens Correct Errors Recall Precision
Numbers 977 920 53 94.17% 94.55%
Acronyms and initials 431 355 40 82.37% 89.87%
Graphic abbreviations 379 232 4 61.21% 98.05%
Total 1787 1507 97 84.33% 93.95%
The work is still in progress
18
References
1. Reichel, U.D., P鍖tzinger, H.R.: Text preprocessing for speech synthesis
(2006)
2. The Festival Speech Synthesis System,
http://www.cstr.ed.ac.uk/projects/festival/
3. Dutoit, T.: An introduction to text-to-speech synthesis (Vol. 3). Springer
Science & Busi-ness Media (1997)
4. Russian Grammar [从舒 亞舒仄仄舒亳从舒]. Vol. 1. Nauka, Moscow (1980)
19
References
5. Rosental, D.E., Golub, I.B., Telenkova, M.A.: The Modern Russian Language
[弌仂于亠仄亠仆仆亶 从亳亶 磶从]. Airis-Press, Moscow (1997)
6. Rosental, D.E., Djandjakova, E.V., Kabanova, N.P.: Reference Book on
Orthography, Pronunciation, Literary Editing [弌仗舒于仂仆亳从 仗仂
仗舒于仂仗亳舒仆亳, 仗仂亳亰仆仂亠仆亳, 仍亳亠舒仆仂仄 亠亟舒从亳仂于舒仆亳].
CheRo, Moscow (1998)
20
NormatexRussian text normalization
github.com/avlukanin/normatex
Artem Lukanin
 about.me/alukanin
 @avlukanin
 artyom.lukanin@gmail.com
際際滷s: artyom.ice-lc.com/slides/normatex
21

More Related Content

Normalization of Non-Standard Words with Finite State Transducers for Russian Speech Synthesis

  • 1. ANALYSIS OF IMAGES, SOCIAL NETWORKS,AND TEXTS April, 9-11th, 2015, Yekaterinburg Normalization of Non-Standard Words with Finite State Transducers for Russian Speech Synthesis Artem Lukanin
  • 2. Text Preprocessing for Speech Synthesis is usually a very complex task Text normalization is one of the steps in text preprocessing [1] sentence segmentation tokenization normalization of non-standard words (NSWs) numbers, abbreviations, and acronyms different characters like % , $ , # , , etc. 2
  • 3. Normalization of Non-Standard Words NSWs must be expanded into full SW to be pronounced correctly It's even more complex in in鍖ective languages such as Russian ordinal number can be converted into 36 different word forms (6 cases * 2 numers * 3 genders) digit position changes the output standard word 1111 1 仗亠于亶 111 11 仂亟亳仆仆舒亟舒亶 11 1 11 仂 1 1 111 舒 11 111 仂亟亳仆仆舒亟舒 3
  • 4. Existing Russian Normalization Systems As a part of proprietory Text-to-Speech (TTS) systems Google Translate, https://translate.google.ru/ VitalVoice, http://cards.voicefabric.ru/ Windows SAPI voices, etc. As a part of open-source TTS systems Festival [2] only digit-by-digit number normalization for the Russian voice 4
  • 5. Normatex is the 鍖rst Russian open-source normalization system, known to the author, github.com/avlukanin/normatex If the input texts are normalized beforehand the quality of the synthesized speech of existing TTS systems can be improved 118 鍖nite state transducers (FSTs) for conversion of cardinal and ordinal numbers into the corresponding numerals, which can preprocess different ranges, time, dates, telephone numbers, postal codes, etc. 33 FSTs for normalization of graphic abbreviations and acronyms 5
  • 6. Test Parallel Corpus 66 original texts of the of鍖cial site of South Ural State University, susu.ac.ru, which contains 38,439 tokens (broad segmentation units [3]): 14,661 word tokens 333 acronyms and 98 initials; 379 graphic abbreviations 977 number tokens (2,511 digits) 66 manually preprocessed texts, where all numbers, abbreviations and acronyms were expanded into full words or replaced with pronounceable combination of letters 6
  • 7. Finite State Transducers are developed in the form of graphs in Unitex 3.1beta Before applying FSTs to a text, it is preprocessed: The text is splitted into sentences The text is tokenized Every token is assigned all possible grammatical forms Number FSTs are applied 鍖rst to deal with numbers and measure unit abbreviations Abbreviation FSTs and acronym FSTs are applied sequentially after that 7
  • 8. Cardinal Numbers agree with nouns in case, but the numerals 仂亟亳仆 one and 亟于舒 two agree in gender as well all the constituent words of a compound numeral agree with the corresponding noun: 亟于舒亟舒亳 仂亟仆仂亞仂 and 亟于舒亟舒亳 仂亟仆仂亶 (twenty- one in gen. m. and f.) 仂亟仆亳 (one in plural) agrees only with pluralia tantum, e.g. 仂亟仆亳 仆仂亢仆亳 one pair of scissors, 仂亟仆亳 弍ミ歳 one pair of pants [4] 8
  • 13. Ordinal Numbers Simple ordinal numerals agree with nouns in gender, case and number In compound ordinal numerals only the last constituent word agrees with the noun [5]: 亟于亠 亳 亠仆舒亟舒仂仄 (two thousand fourteenth in prepositional masculine) Complex ordinal numbers, ending in -00,-000,-000000,-000000000, are written without spaces: 153000 is converted into 仂仗亳亟亠亳仆亶 one hundred and 鍖fty-three thousandth in nominative masculine 13
  • 14. Ordinal Numbers Only the last constituent words -仂亶 hundredth, -仆亶 thousandth, -仄亳仍仍亳仂仆仆亶 millionth, -仄亳仍仍亳舒亟仆亶 billionth agree with the nouns The words, preceding the last word, are used in genitive plural (the exceptions are 仂 one hundred and 亟亠于礌仂仂 ninety, which are used in the nominative case) [6] 14
  • 15. Acronyms Most acronyms should be converted into full words before speech synthesis, because it is dif鍖cult for people to comprehend a letter-by- letter pronunciation in speech and because acronyms are often rare for everybody to know what phrase the acronym corresponds to 个丕 束豫丕丕損 (丕) 个亠亟亠舒仍仆仂亠 亞仂亟舒于亠仆仆仂亠 弍ミ莞勤却仆仂亠 仂弍舒亰仂于舒亠仍仆仂亠 亠亢亟亠仆亳亠 于亠亞仂 仗仂亠亳仂仆舒仍仆仂亞仂 仂弍舒亰仂于舒仆亳 束豫亢仆仂-丕舒仍从亳亶 亞仂亟舒于亠仆仆亶 仆亳于亠亳亠損 (舒仆仂-亳仍亠亟仂于舒亠仍从亳亶 仆亳于亠亳亠) 个丕 束豫丕丕損 (丕) 15
  • 16. Acronyms The main component of an acronym is a noun, that is why there can be 12 possible forms of the converted phrase (six cases and two numbers) in Russian There are rules for all six cases in Normatex Acronyms can be ambiguous in different corpora For all ambiguous or unknown acronyms Normatex substitutes each letter with its alphabet name: 16
  • 17. Graphic Abbreviations Single interpretation: 亳 .亟. etc. 亳 舒从 亟舒仍亠亠 , .亠. i.e. 仂 亠 The interpretation depends on the context: 亳 亟. et al. 亳 亟亞亳亠 and others, 亳 亟亞亳 and others, 亳 亟亞亳仄 and others, 亳 亟亞仂亠 and other Ambiguous: 亞. 亞仂亟 year, 亞仂仂亟 city, 亞舒仄仄 gram (every noun can have 12 word forms), 亟亳仂亳: 339-亞, 339-亟 Room 339-g, 339-d Suf鍖cient left and right contexts should be provided in FSTs as well as FSTs should be applied in a de鍖nite order 17
  • 18. Results Token type Tokens Correct Errors Recall Precision Numbers 977 920 53 94.17% 94.55% Acronyms and initials 431 355 40 82.37% 89.87% Graphic abbreviations 379 232 4 61.21% 98.05% Total 1787 1507 97 84.33% 93.95% The work is still in progress 18
  • 19. References 1. Reichel, U.D., P鍖tzinger, H.R.: Text preprocessing for speech synthesis (2006) 2. The Festival Speech Synthesis System, http://www.cstr.ed.ac.uk/projects/festival/ 3. Dutoit, T.: An introduction to text-to-speech synthesis (Vol. 3). Springer Science & Busi-ness Media (1997) 4. Russian Grammar [从舒 亞舒仄仄舒亳从舒]. Vol. 1. Nauka, Moscow (1980) 19
  • 20. References 5. Rosental, D.E., Golub, I.B., Telenkova, M.A.: The Modern Russian Language [弌仂于亠仄亠仆仆亶 从亳亶 磶从]. Airis-Press, Moscow (1997) 6. Rosental, D.E., Djandjakova, E.V., Kabanova, N.P.: Reference Book on Orthography, Pronunciation, Literary Editing [弌仗舒于仂仆亳从 仗仂 仗舒于仂仗亳舒仆亳, 仗仂亳亰仆仂亠仆亳, 仍亳亠舒仆仂仄 亠亟舒从亳仂于舒仆亳]. CheRo, Moscow (1998) 20
  • 21. NormatexRussian text normalization github.com/avlukanin/normatex Artem Lukanin about.me/alukanin @avlukanin artyom.lukanin@gmail.com 際際滷s: artyom.ice-lc.com/slides/normatex 21