狠狠撸

Learning to Translate with
Joey NMT
PyData Meetup Montreal
Julia Kreutzer
Feb 25, 2021

Today
1. Neural Machine Translation 101
a. Translation as a ML problem
b. Transformer model
c. The role of data
2. Joey NMT
a. Features and purpose
b. Demo
c. Use cases
3. Q & A
Assuming basic ML knowledge, familiarity with neural networks.
What's the technology behind modern
machine translation?
How can you get started?
Why open-sourcing?
Why another toolkit?

[Optional] Demo Preparation
If you want to train your own translation model during this presentation:
1. Open joey_demo.ipynb on Colab.
2. Create a copy.
3. Select GPU runtime: Runtime -> Change runtime type -> Hardware accelerator: GPU
4. Run all cells: Runtime -> Run all
5. Come back to the talk :)
We'll inspect later what's happening there.

Neural Machine Translation 101

Translation as a ML Problem
Challenges
? Unlimited length
? Structural dependencies
? Unseen words
? Figurative language
Seq2Seq
? Modeling sentences (mostly)
? Connections between all words
? Sub-word modeling
? A lot of training data
Input: What is a poutine ?
Output: Qu'est-ce qu'une poutine ?

The
Transformer
"Attention is all you need"
Vaswani et al. 2017
Decoder
Specialties
Source: Vaswani et al. 2017

Training vs. Inference
Conditional language modeling: Predict the next token yt
:
● given source X and all previous tokens of the reference during training.
● given source X and previously predicted tokens during inference.
Training with MLE, inference with greedy or beam search.

Beam Search
Source: G. Neubig's course on MT and Seq2Seq
Keep the k most likely
prediction sequences in
each step.
? more expensive
than greedy
? more exact
Implementation on
mini-batches is tricky!
k=2

Words?
Pre-processing plays a huge role in NMT.
qu'est-ce qu'une poutine ? 4 tokens, 4 types
vs
qu ' est - ce qu ' une poutine ? 10 tokens, 8 types
? Sub-words instead of words: frequency-based automatic segmentation.
? Algorithms: BPE, unigram LM.
? Implementations: subword-nmt, SentencePiece.

The Role of Data
A "base"-sized Transformer has ~65M weights. How much data does it need?
? It depends!
? “As much as you can ?nd" heuristic
? Beyond parallel data
○ unsupervised NMT
○ data augmentation
○ dictionaries
○ pre-trained embeddings
○ multilingual modeling
How similar are source and target language?
What kind of quality are you expecting?
How complex is the text?

Evaluation
Reference: Qu'est-ce qu'une poutine ?
Outputs:
1. Est-ce qu'une poutine ?
2. Que-ce une poutine ?
3. Qu'une poutine ?
4. Qu'est-ce qu'un poutin ?
5. C'est qu'une poutine .
How should these outputs be ranked / scored?

Evaluation
Reference: Qu'est-ce qu'une poutine ?
Outputs:
1. Est-ce qu'une poutine ? 59.5 82.8
2. Que-ce une poutine ? 32.0 51.4
3. Qu'une poutine ? 39.4 58.3
4. Qu'est-ce qu'un poutin ? 19.0 74.4
5. C'est qu'une poutine . 32.0 60.8
BLEU: geometric average of
token n-gram precisions,
brevity penalty
ChrF: character
n-gram F-score

Joey NMT
Joint work with Jasmijn Bastings, Mayumi Ohta and Joey NMT contributors

Problem
+ A lot of code for NMT is online.
+ Free compute through Colab.
+ Data is freely available. Is it clean?
How long would I have to study it?
Are all features documented?
How can I run it on Colab?
How do I need to prepare data to use it?
Does that mean it's accessible?

Solution
Joey NMT: clean, minimalist, documented.
? Much smaller than other toolkits
? Covers core features
? User study on usability
? The core API changes very little.
? Examples, pre-trained models, tutorials, FAQ
? Based on PyTorch
Does not do
everything,
does not grow
much.

Features
You can:
● train a RNN/Transformer model
● on CPU, one or multiple GPUs
● monitor the training process
● con?gure hyperparameters
● store it, load it, test it
And more:
● follow training recipes
● modify the code easily
● get inspiration from other extensions
● share/load pre-trained models

It's cute, but can it compete?
Quality?
? Comparable to other toolkits.
Adoption?
? Not as popular.
Innovation?
? More and more research.

It's cute, but can it compete?
Quality?
? Comparable to other toolkits.
Adoption?
? Not as popular.
Innovation?
? More and more research.
It might not be the best choice for
? exact replication of another paper
-> use their code instead
? non-seq2seq applications
? performance-critical applications
(not optimized for it)
? loading BERT (not implemented)

Cool stuff feat. Joey NMT
Grassroots research communities
? Masakhane: NLP for African languages
? Turkic Interlingua: NLP for Turkic languages
Extensions
? Reinforcement learning
? Sign language translation
? Speech translation
? Image captioning
? Slack bot
More on this list.

Material
? Neural networks in NLP
○ Y. Goldberg: A Primer on Neural Network Models for Natural Language Processing
○ G. Neubig: CMU CS 11-747: Neural Networks for NLP
? Neural Machine translation
○ P. Koehn: Neural Machine Translation (Draft Chapter of the Statistical MT book)
○ G. Neubig: Tutorial on Neural Machine Translation
○ A. Rush: The Annotated Transformer
○ J. Bastings: The Annotated Encoder-Decoder
○ M. Müller: Seven Recommendations for MT Evaluation
? Joey NMT
○ Joey NMT paper
○ Joey NMT tutorial
○ Masakhane notebooks and YouTube tutorial
○ Turkic Interlingua YouTube tutorial

Thank you!
jkreutzer@google.com
Twitter: @KreutzerJulia
Q & A

狠狠撸

Learning to Translate with Joey NMT

Convert to study materialsBETA

More Related Content

Learning to Translate with Joey NMT