This document discusses text normalization for Russian speech synthesis. It introduces Normatex, an open-source Russian text normalization system using finite state transducers. Normatex expands non-standard words like numbers, abbreviations, and acronyms. It achieved 84.33% recall and 93.95% precision on a test corpus. The document outlines challenges in Russian normalization like inflection and ambiguity, and describes how Normatex handles cardinal and ordinal numbers, acronyms, and abbreviations.
1 of 21
Download to read offline
More Related Content
Normalization of Non-Standard Words with Finite State Transducers for Russian Speech Synthesis
1. ANALYSIS OF IMAGES, SOCIAL NETWORKS,AND TEXTS
April, 9-11th, 2015, Yekaterinburg
Normalization of Non-Standard Words
with Finite State Transducers
for Russian Speech Synthesis
Artem Lukanin
2. Text Preprocessing for Speech Synthesis
is usually a very complex task
Text normalization is one of the steps in text preprocessing [1]
sentence segmentation
tokenization
normalization of non-standard words (NSWs)
numbers, abbreviations, and acronyms
different characters like % , $ , # , , etc.
2
3. Normalization of Non-Standard Words
NSWs must be expanded into full SW to be pronounced correctly
It's even more complex in in鍖ective languages such as Russian
ordinal number can be converted into 36 different word forms (6
cases * 2 numers * 3 genders)
digit position changes the output standard word
1111 1 仗亠于亶
111 11 仂亟亳仆仆舒亟舒亶
11 1 11 仂
1 1 111 舒
11 111 仂亟亳仆仆舒亟舒
3
4. Existing Russian Normalization Systems
As a part of proprietory Text-to-Speech (TTS) systems
Google Translate, https://translate.google.ru/
VitalVoice, http://cards.voicefabric.ru/
Windows SAPI voices, etc.
As a part of open-source TTS systems
Festival [2]
only digit-by-digit number normalization for the Russian voice
4
5. Normatex
is the 鍖rst Russian open-source normalization system, known to the
author, github.com/avlukanin/normatex
If the input texts are normalized beforehand the quality of the
synthesized speech of existing TTS systems can be improved
118 鍖nite state transducers (FSTs) for conversion of cardinal and ordinal
numbers into the corresponding numerals, which can preprocess
different ranges, time, dates, telephone numbers, postal codes, etc.
33 FSTs for normalization of graphic abbreviations and acronyms
5
6. Test Parallel Corpus
66 original texts of the of鍖cial site of South Ural State University,
susu.ac.ru, which contains 38,439 tokens (broad segmentation units [3]):
14,661 word tokens
333 acronyms and 98 initials; 379 graphic abbreviations
977 number tokens (2,511 digits)
66 manually preprocessed texts, where all numbers, abbreviations and
acronyms were expanded into full words or replaced with pronounceable
combination of letters
6
7. Finite State Transducers
are developed in the form of graphs in Unitex 3.1beta
Before applying FSTs to a text, it is preprocessed:
The text is splitted into sentences
The text is tokenized
Every token is assigned all possible grammatical forms
Number FSTs are applied 鍖rst to deal with numbers and measure unit
abbreviations
Abbreviation FSTs and acronym FSTs are applied sequentially after that
7
8. Cardinal Numbers
agree with nouns in case, but the numerals 仂亟亳仆 one and 亟于舒 two
agree in gender as well
all the constituent words of a compound numeral agree with the
corresponding noun: 亟于舒亟舒亳 仂亟仆仂亞仂 and 亟于舒亟舒亳 仂亟仆仂亶 (twenty-
one in gen. m. and f.)
仂亟仆亳 (one in plural) agrees only with pluralia tantum, e.g. 仂亟仆亳
仆仂亢仆亳 one pair of scissors, 仂亟仆亳 弍ミ歳 one pair of pants [4]
8
13. Ordinal Numbers
Simple ordinal numerals agree with nouns in gender, case and number
In compound ordinal numerals only the last constituent word agrees with
the noun [5]: 亟于亠 亳 亠仆舒亟舒仂仄 (two thousand fourteenth in
prepositional masculine)
Complex ordinal numbers, ending in -00,-000,-000000,-000000000, are
written without spaces: 153000 is converted into
仂仗亳亟亠亳仆亶 one hundred and 鍖fty-three thousandth
in nominative masculine
13
14. Ordinal Numbers
Only the last constituent words -仂亶 hundredth, -仆亶
thousandth, -仄亳仍仍亳仂仆仆亶 millionth, -仄亳仍仍亳舒亟仆亶 billionth
agree with the nouns
The words, preceding the last word, are used in genitive plural (the
exceptions are 仂 one hundred and 亟亠于礌仂仂 ninety, which are
used in the nominative case) [6]
14
15. Acronyms
Most acronyms should be converted into full words before speech
synthesis, because it is dif鍖cult for people to comprehend a letter-by-
letter pronunciation in speech and because acronyms are often rare for
everybody to know what phrase the acronym corresponds to
个丕 束豫丕丕損 (丕) 个亠亟亠舒仍仆仂亠 亞仂亟舒于亠仆仆仂亠
弍ミ莞勤却仆仂亠 仂弍舒亰仂于舒亠仍仆仂亠 亠亢亟亠仆亳亠 于亠亞仂
仗仂亠亳仂仆舒仍仆仂亞仂 仂弍舒亰仂于舒仆亳 束豫亢仆仂-丕舒仍从亳亶 亞仂亟舒于亠仆仆亶
仆亳于亠亳亠損 (舒仆仂-亳仍亠亟仂于舒亠仍从亳亶 仆亳于亠亳亠)
个丕 束豫丕丕損 (丕)
15
16. Acronyms
The main component of an acronym is a noun, that is why there can be
12 possible forms of the converted phrase (six cases and two numbers) in
Russian
There are rules for all six cases in Normatex
Acronyms can be ambiguous in different corpora
For all ambiguous or unknown acronyms Normatex substitutes each
letter with its alphabet name:
16
17. Graphic Abbreviations
Single interpretation: 亳 .亟. etc. 亳 舒从 亟舒仍亠亠 , .亠. i.e. 仂 亠
The interpretation depends on the context: 亳 亟. et al. 亳 亟亞亳亠
and others, 亳 亟亞亳 and others, 亳 亟亞亳仄 and others, 亳 亟亞仂亠
and other
Ambiguous: 亞. 亞仂亟 year, 亞仂仂亟 city, 亞舒仄仄 gram (every noun
can have 12 word forms), 亟亳仂亳: 339-亞, 339-亟 Room 339-g, 339-d
Suf鍖cient left and right contexts should be provided in FSTs as well as
FSTs should be applied in a de鍖nite order
17
18. Results
Token type Tokens Correct Errors Recall Precision
Numbers 977 920 53 94.17% 94.55%
Acronyms and initials 431 355 40 82.37% 89.87%
Graphic abbreviations 379 232 4 61.21% 98.05%
Total 1787 1507 97 84.33% 93.95%
The work is still in progress
18
19. References
1. Reichel, U.D., P鍖tzinger, H.R.: Text preprocessing for speech synthesis
(2006)
2. The Festival Speech Synthesis System,
http://www.cstr.ed.ac.uk/projects/festival/
3. Dutoit, T.: An introduction to text-to-speech synthesis (Vol. 3). Springer
Science & Busi-ness Media (1997)
4. Russian Grammar [从舒 亞舒仄仄舒亳从舒]. Vol. 1. Nauka, Moscow (1980)
19
20. References
5. Rosental, D.E., Golub, I.B., Telenkova, M.A.: The Modern Russian Language
[弌仂于亠仄亠仆仆亶 从亳亶 磶从]. Airis-Press, Moscow (1997)
6. Rosental, D.E., Djandjakova, E.V., Kabanova, N.P.: Reference Book on
Orthography, Pronunciation, Literary Editing [弌仗舒于仂仆亳从 仗仂
仗舒于仂仗亳舒仆亳, 仗仂亳亰仆仂亠仆亳, 仍亳亠舒仆仂仄 亠亟舒从亳仂于舒仆亳].
CheRo, Moscow (1998)
20