際際滷

際際滷Share a Scribd company logo
On the Utility of Moses for
Sinhala Tamil Translation
1
Yashothara.S, Dr.R.T.Uthayasanker
National Language Processing center
Outline
 Background: Statistical Machine Translation (SMT)
 Introduction to Moses
 Training
 Decoder
2
Machine Translation
 Process of translating from one language into
another language using a computer
 Types of machine translation
 Rule based
 Example based
 Knowledge based
 Statistical based
 Hybrid model based
 Neural network based
3
ComputerSource Target
Statistical Machine Translation
4
Hmmm. Every times she sees
犇伍犒, she either types
牀牀牀む or 牀牀牆牀牆牀牆牀
 but if she sees 犇伍犒 犇園狂
she always types 牀牀牀む
S
S T
S T
T
Translate, translate

Parallel Corpus
Statistical Machine Translation
5
s-Sinhala
t-Tamil
TM LM
P(t|s) P(s|t) p(t)
Statistical Machine
Translation
6
Translation
Model
Language
ModelTM LM
Decoder
犇伍犒 犇犇 犇園狂 犇材蹡巌怯犇 犒
犇 .
牀牀牀む 牀牀萎 牀牀鉦巌牀牆牀牀鉦牀牆.
Moses
 Open source SMT framework
 Language independent
 Plug and play
Steps
1. Preprocessing
2. Translation Model Building
3. Language Model Building
4. Decoding
7
Step1: Preprocessing
 Tokenization: Splitting the sentences as tokens
 tokenizer.perl script can be used.
Example:
Before tokenizing
犇犒犒犇 犒犇犇犒犇材 犇 犒犇死境犒犇死 犇準橋犒 犇犒犇園犇.
牀牀橿牀 牀む牀朽迦牀牀牀 牀朽巌牆牀牀朽牀牆.
After Tokenizing
犇犒犒犇 犒犇犇犒 犇材 犇 犒犇死境犒犇死 犇準橋犒 犇犒犇 犇 .
牀牀橿牀 牀む牀朽 牆 牀牀牀 牀朽巌 牆 牀牀朽牀 牆 .
8
Step1: Preprocessing
 Cleaning: Removing low quality sentences
 clean-corpus-n.perl can be used.
9
Sinhala Tamil
犇犇 犒犇園犇犒 犒蹡 犇巌犇伍犇犒犇伍凶 犇犇園犒犒犇犒 犇犒犇犒犒犇材犒
犇犇犒犇 犒犒犇犒 犇園狂犒 犇 犇謹 犒犇園蹡 犇犒犇死犇犇伍怯犒
犇犇園境犒犇死犒 犒犒犇犒犒犒.
牀牀朽萎牀牆牀牆 牀牀鉦牀牀鉦牀迦牆牀牆牀牆
牀牀牆牀牀牀橿逗牆牀牀牆牀牀逗牀鉦 牀牀牆牀む萎牀牆牀牀む牀む逗迦
牀牀む牀牀園牀園 牀牀園牀む逗牆牀牀牆牀む牀む逗牀む牆 牀牀逗牆牀牀萎
牀牀牀鉦萎牀む牀む牀鉦 牀牀萎牀朽萎牀牆牀牆..
牀牀牀む 牀牀牀牀萎 牀牆牀む.
犒犒犒犒 犒犒犒犇材蹡準矯犒犇 犇犇萎犇材蹡巌怯犇 犇犇犒犒犒 犇犒犒
犇犇萎犇材蹡巌怯犇材凶 犇巌犇死犒犒 犒犒犇伍凶 犇犇犇犒犇死犒犒 犒犒犇園犇 犇犇巌犒蹡
犒犇園 犇 .
牀牀牀萎牀牆牀む
犇伍犒 犇伍犇犒犇死犇材犒 犇園狂 犒犒犇謹 犇. 牀牀牀む 牀牀牆牀牀逗牀逗牆 牀牀牀牀萎 牀牆牀牀.
10
Language
Model
Translation
Model
TM LM
Decoder
Parallel corpus
牀牀牀む 牀牀萎 牀牀鉦巌牀牆牀牀鉦牀牆.犇伍犒 犇犇 犇材蹡巌怯犇 犇犒.
Step 2:Building Translation Model
 Assigns probability P(s|t) to the pair of target and source
words/phrases
11
Sinhala Tamil (s|t)
犇伍犒 牀牀牀む 0.66
犇伍犒 牀牀牆牀牆牀牆牀 0.22
犇伍犒 犇犇巌 牀牀牀む 牀牆牀む牀む牀牆 0.12
犇伍犒 犇園狂 犇犒 犒 牀牀牀む 牀牀牀牀萎 牀牆牀む 0.22
E.g.
犇伍犒 犇園狂 犇犒 犒 犇犒. 牀牀牀む 牀牀牀牀萎 牀牆牀む.
犇伍犒 犇犇巌 . 牀牀牆牀牆牀牆牀 牀牆牀む牀む牀牆.
Word Alignment toolS T P(s|t)
GIZA++
12
Language
Model
Phrase Table
LM
Decoder
Monolingual corpus
Si Ta (s|t)
犇伍犒 牀牀牀む 0.66
犇伍犒 犇園狂 牀牀牀む
牀牀牀牀萎
0.12
犇伍犒 犇犇 犇材蹡巌怯犇 犇犒. 牀牀牀む 牀牀萎 牀牀鉦巌牀牆牀牀鉦牀牆.
Building Language model
 Used to ensure the fluent output.
Getting probability of each word according to the n-grams. Standardly
calculated with a trigram language model
 Using KenLM or SRILM* or irstlm
E.g. 牀萎鉦牆 牀牀牆牀牀 牀牀牀逗む牀む鉦牆
牀萎鉦牆 牀牀牆牀牀 牀 牆牀牀逗牀鉦牆
13
Count(牀萎鉦牆 牀牀牆牀牀 牀牀牀逗む牀む鉦牆)
Count(牀萎鉦牆 牀牀牆牀牀)
P(牀牀牀逗む牀む鉦牆| 牀萎鉦牆 牀牀牆牀牀) =
w3 w1w2 score
牀牀牀逗む牀む鉦牆 牀萎鉦牆 牀牀牆牀牀 -1.855783
牀 牆牀牀逗牀鉦牆 牀萎鉦牆 牀牀牆牀牀 -0.4191293
w3 w1w2 score
牀牀鉦牆牀牀逗 牀牀鉦牆
牀牀牆牀牆牀牆
-1.855783
牀牀牀
牀牀牀牆
牀牀牆牀牆牀牆
牀牀鉦牆牀牀逗
-0.4191293
14
Phrase Table
Decoder
Language Model Table
Si Ta (S|T)
犇伍犒 牀牀牀む 0.66
犇伍犒 犇園狂 牀牀牀む
牀牀牀牀萎
0.12
牀牀牀む 牀牀萎 牀牀鉦巌牀牆牀牀鉦牀牆.犇伍犒 犇犇 犇材蹡巌怯犇 犇犒.
15
牀牀牀む
牀牀牆牀牆牀牆牀
牀牀鉦巌牀牆牀牀鉦牀牆
牀牀萎
犇犇 犇材蹡巌怯犇 犇犒.
Sinhala Tamil (s|t)
犇伍犒 牀牀牀む 0.66
犇伍犒 牀牀牆牀牆牀牆牀 0.22
犇犇 牀牀萎 0.34
犇伍犒
犇犇
牀牀牀む 牀牀萎 0.23
犇材蹡巌怯犇 牀牀鉦巌牀牆牀牀鉦牀牆 0.25
犇犒 牀牆牀む 0.12
犇材蹡巌怯犇
犇犒
牀牀鉦巌牀牆牀牀鉦牀牆 0.62
牀牆牀む鉦牀鉦巌牀牆牀牀鉦牀牆
牀牀萎
牀牀鉦巌牀牆牀牀鉦牀牆
牀牀鉦巌牀牆牀牀鉦牀牆
牀牀牀む 牀牀萎
牀牆牀む
牀牀鉦巌牀牆牀牀鉦牀牆
犇伍犒
牀牀鉦巌牀牆牀牀鉦牀牆
牀牆牀む
Using Moses for Si-Ta
Translation
 Custom Tokenization
 Morphology rich languages
 Low resource languages
 Standards are not well established
16
Thanks & Questions
17
Ad

Recommended

Algorithms for certain classes of tamil spelling correction
Algorithms for certain classes of tamil spelling correction
Shrinivasan T
Pos Integration to MOSES
Pos Integration to MOSES
yashothara shanmugarajah
Moses Statistical Machine Translation tool
Moses Statistical Machine Translation tool
yashothara shanmugarajah
2024 Trend Updates: What Really Works In SEO & Content Marketing
2024 Trend Updates: What Really Works In SEO & Content Marketing
Search Engine Journal
Storytelling For The Web: Integrate Storytelling in your Design Process
Storytelling For The Web: Integrate Storytelling in your Design Process
Chiara Aliotta
Artificial Intelligence, Data and Competition SCHREPEL June 2024 OECD dis...
Artificial Intelligence, Data and Competition SCHREPEL June 2024 OECD dis...
OECD Directorate for Financial and Enterprise Affairs
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
How to Leverage AI to Boost Employee Wellness - Lydia Di Francesco - SocialHR...
SocialHRCamp
2024 State of Marketing Report by Hubspot
2024 State of Marketing Report by Hubspot
Marius Sescu
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
Skeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
Introduction to Data Science
Introduction to Data Science
Christy Abraham Joy
Time Management & Productivity - Best Practices
Time Management & Productivity - Best Practices
Vit Horky
The six step guide to practical project management
The six step guide to practical project management
MindGenius
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools

More Related Content

Featured (20)

Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
Skeleton Culture Code
Skeleton Culture Code
Skeleton Technologies
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
How to have difficult conversations
How to have difficult conversations
Rajiv Jayarajah, MAppComm, ACC
Introduction to Data Science
Introduction to Data Science
Christy Abraham Joy
Time Management & Productivity - Best Practices
Time Management & Productivity - Best Practices
Vit Horky
The six step guide to practical project management
The six step guide to practical project management
MindGenius
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
Expeed Software
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
Pixeldarts
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
marketingartwork
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
Neil Kimberley
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
contently
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
Albert Qian
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
Search Engine Journal
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
SpeakerHub
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
Clark Boyd
Getting into the tech field. what next
Getting into the tech field. what next
Tessa Mero
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Lily Ray
Time Management & Productivity - Best Practices
Time Management & Productivity - Best Practices
Vit Horky
The six step guide to practical project management
The six step guide to practical project management
MindGenius
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
RachelPearson36
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Applitools

Moses

  • 1. On the Utility of Moses for Sinhala Tamil Translation 1 Yashothara.S, Dr.R.T.Uthayasanker National Language Processing center
  • 2. Outline Background: Statistical Machine Translation (SMT) Introduction to Moses Training Decoder 2
  • 3. Machine Translation Process of translating from one language into another language using a computer Types of machine translation Rule based Example based Knowledge based Statistical based Hybrid model based Neural network based 3 ComputerSource Target
  • 4. Statistical Machine Translation 4 Hmmm. Every times she sees 犇伍犒, she either types 牀牀牀む or 牀牀牆牀牆牀牆牀 but if she sees 犇伍犒 犇園狂 she always types 牀牀牀む S S T S T T Translate, translate Parallel Corpus
  • 6. Statistical Machine Translation 6 Translation Model Language ModelTM LM Decoder 犇伍犒 犇犇 犇園狂 犇材蹡巌怯犇 犒 犇 . 牀牀牀む 牀牀萎 牀牀鉦巌牀牆牀牀鉦牀牆.
  • 7. Moses Open source SMT framework Language independent Plug and play Steps 1. Preprocessing 2. Translation Model Building 3. Language Model Building 4. Decoding 7
  • 8. Step1: Preprocessing Tokenization: Splitting the sentences as tokens tokenizer.perl script can be used. Example: Before tokenizing 犇犒犒犇 犒犇犇犒犇材 犇 犒犇死境犒犇死 犇準橋犒 犇犒犇園犇. 牀牀橿牀 牀む牀朽迦牀牀牀 牀朽巌牆牀牀朽牀牆. After Tokenizing 犇犒犒犇 犒犇犇犒 犇材 犇 犒犇死境犒犇死 犇準橋犒 犇犒犇 犇 . 牀牀橿牀 牀む牀朽 牆 牀牀牀 牀朽巌 牆 牀牀朽牀 牆 . 8
  • 9. Step1: Preprocessing Cleaning: Removing low quality sentences clean-corpus-n.perl can be used. 9 Sinhala Tamil 犇犇 犒犇園犇犒 犒蹡 犇巌犇伍犇犒犇伍凶 犇犇園犒犒犇犒 犇犒犇犒犒犇材犒 犇犇犒犇 犒犒犇犒 犇園狂犒 犇 犇謹 犒犇園蹡 犇犒犇死犇犇伍怯犒 犇犇園境犒犇死犒 犒犒犇犒犒犒. 牀牀朽萎牀牆牀牆 牀牀鉦牀牀鉦牀迦牆牀牆牀牆 牀牀牆牀牀牀橿逗牆牀牀牆牀牀逗牀鉦 牀牀牆牀む萎牀牆牀牀む牀む逗迦 牀牀む牀牀園牀園 牀牀園牀む逗牆牀牀牆牀む牀む逗牀む牆 牀牀逗牆牀牀萎 牀牀牀鉦萎牀む牀む牀鉦 牀牀萎牀朽萎牀牆牀牆.. 牀牀牀む 牀牀牀牀萎 牀牆牀む. 犒犒犒犒 犒犒犒犇材蹡準矯犒犇 犇犇萎犇材蹡巌怯犇 犇犇犒犒犒 犇犒犒 犇犇萎犇材蹡巌怯犇材凶 犇巌犇死犒犒 犒犒犇伍凶 犇犇犇犒犇死犒犒 犒犒犇園犇 犇犇巌犒蹡 犒犇園 犇 . 牀牀牀萎牀牆牀む 犇伍犒 犇伍犇犒犇死犇材犒 犇園狂 犒犒犇謹 犇. 牀牀牀む 牀牀牆牀牀逗牀逗牆 牀牀牀牀萎 牀牆牀牀.
  • 10. 10 Language Model Translation Model TM LM Decoder Parallel corpus 牀牀牀む 牀牀萎 牀牀鉦巌牀牆牀牀鉦牀牆.犇伍犒 犇犇 犇材蹡巌怯犇 犇犒.
  • 11. Step 2:Building Translation Model Assigns probability P(s|t) to the pair of target and source words/phrases 11 Sinhala Tamil (s|t) 犇伍犒 牀牀牀む 0.66 犇伍犒 牀牀牆牀牆牀牆牀 0.22 犇伍犒 犇犇巌 牀牀牀む 牀牆牀む牀む牀牆 0.12 犇伍犒 犇園狂 犇犒 犒 牀牀牀む 牀牀牀牀萎 牀牆牀む 0.22 E.g. 犇伍犒 犇園狂 犇犒 犒 犇犒. 牀牀牀む 牀牀牀牀萎 牀牆牀む. 犇伍犒 犇犇巌 . 牀牀牆牀牆牀牆牀 牀牆牀む牀む牀牆. Word Alignment toolS T P(s|t) GIZA++
  • 12. 12 Language Model Phrase Table LM Decoder Monolingual corpus Si Ta (s|t) 犇伍犒 牀牀牀む 0.66 犇伍犒 犇園狂 牀牀牀む 牀牀牀牀萎 0.12 犇伍犒 犇犇 犇材蹡巌怯犇 犇犒. 牀牀牀む 牀牀萎 牀牀鉦巌牀牆牀牀鉦牀牆.
  • 13. Building Language model Used to ensure the fluent output. Getting probability of each word according to the n-grams. Standardly calculated with a trigram language model Using KenLM or SRILM* or irstlm E.g. 牀萎鉦牆 牀牀牆牀牀 牀牀牀逗む牀む鉦牆 牀萎鉦牆 牀牀牆牀牀 牀 牆牀牀逗牀鉦牆 13 Count(牀萎鉦牆 牀牀牆牀牀 牀牀牀逗む牀む鉦牆) Count(牀萎鉦牆 牀牀牆牀牀) P(牀牀牀逗む牀む鉦牆| 牀萎鉦牆 牀牀牆牀牀) = w3 w1w2 score 牀牀牀逗む牀む鉦牆 牀萎鉦牆 牀牀牆牀牀 -1.855783 牀 牆牀牀逗牀鉦牆 牀萎鉦牆 牀牀牆牀牀 -0.4191293
  • 14. w3 w1w2 score 牀牀鉦牆牀牀逗 牀牀鉦牆 牀牀牆牀牆牀牆 -1.855783 牀牀牀 牀牀牀牆 牀牀牆牀牆牀牆 牀牀鉦牆牀牀逗 -0.4191293 14 Phrase Table Decoder Language Model Table Si Ta (S|T) 犇伍犒 牀牀牀む 0.66 犇伍犒 犇園狂 牀牀牀む 牀牀牀牀萎 0.12 牀牀牀む 牀牀萎 牀牀鉦巌牀牆牀牀鉦牀牆.犇伍犒 犇犇 犇材蹡巌怯犇 犇犒.
  • 15. 15 牀牀牀む 牀牀牆牀牆牀牆牀 牀牀鉦巌牀牆牀牀鉦牀牆 牀牀萎 犇犇 犇材蹡巌怯犇 犇犒. Sinhala Tamil (s|t) 犇伍犒 牀牀牀む 0.66 犇伍犒 牀牀牆牀牆牀牆牀 0.22 犇犇 牀牀萎 0.34 犇伍犒 犇犇 牀牀牀む 牀牀萎 0.23 犇材蹡巌怯犇 牀牀鉦巌牀牆牀牀鉦牀牆 0.25 犇犒 牀牆牀む 0.12 犇材蹡巌怯犇 犇犒 牀牀鉦巌牀牆牀牀鉦牀牆 0.62 牀牆牀む鉦牀鉦巌牀牆牀牀鉦牀牆 牀牀萎 牀牀鉦巌牀牆牀牀鉦牀牆 牀牀鉦巌牀牆牀牀鉦牀牆 牀牀牀む 牀牀萎 牀牆牀む 牀牀鉦巌牀牆牀牀鉦牀牆 犇伍犒 牀牀鉦巌牀牆牀牀鉦牀牆 牀牆牀む
  • 16. Using Moses for Si-Ta Translation Custom Tokenization Morphology rich languages Low resource languages Standards are not well established 16