狠狠撸

2020.08.14
田口直弥
ACL 2020 参加報告
~ BPE 系 tokenization の背景と最新動向 ~

2
? text tokenization について
? NLP を ML で行う際の text 分割手法
? 特に BPE に関連する ACL2020 の論文２本を紹介
今回話す内容
I stayed at home this weekend.
I s t a y e d a t h o m e t h i s w e e k e n d .
I stayed at home this weekend .
I stay ed at home this week end .

3
? 少なくとも NLP w/ NN をするなら tokenization するはず
? 最近 kaggle でも BPE tokenizer を使った
? アプローチが汎用的だと思った
? NLP に限らずここに着想を得て結構応用が効きそう
なんでこのテーマ？

4
? seq2seq フレームワーク　
? token 系列を入力とし、token 系列を出力
? models は RNNs, CNNs, transformers 等...
? 詳細はボリューム大 & 今回は説明なしでも話が進むので省略
Neural Machine Translation (NMT)
models
I stayed at home this weekend .
この週末は家にいました。

5
? 機械翻訳でよく使用される metric
? 0~1 で高いほどよい翻訳結果
? brevity penalty: 予測訳が正解訳より短い場合にペナルティを課す
? n-gram overlap: 1~4-gram の modi?ed precision (被り n-gram 数 / 全 n-gram 数)
BLEU [Papineni et al., 2002] https://cloud.google.com/translate/automl/docs/evaluate?hl=ja
(~0.1)
(0.1~0.19)
(0.2~0.29)
(0.3~0.4)
(0.4~0.5)
(0.5~0.6)
(0.6~)
bleu score の解釈 (※図のスコアは 0~100 で書かれているので補正値を青字で記載)

6
ACL
2016
Neural Machine Translation of
Rare Words with Subword Unit
[Sennrich et al., 2016]

7
? 効率的な tokenization 手法である BPE を提案
? 学習データにない単語にも対応可能 (open-vocaburaly)
? 語彙数を抑えつつ、使用時の性能も良い
? (NLP 一般に使えるが) 着目する問題設定は NMT (機械翻訳)
概要

8
良い tokenization の設計方針
できるだけ tokenization の粒度を細かくしたい
?vocabulary 関連の時空間計算量の削減
?組み合わせによる表現で open-vocabulary 化が期待できる
?etc...
できるだけ tokenization の粒度を荒くしたい
?分割 token 数の削減による処理の効率化
?分割 token 数の削減による NN の情報伝播負担の削減
?etc...
trade o?
頻出語はまとめ、珍しい単語は
subword の組み合わせで表現！

9
? 情報圧縮に使われる技術の話
? 頻出する 2 byte を 1 byte 文字に置き換えることを繰り返す
Byte Pair Encoding (BPE) [Gage, 1994]
BPE の適用例
(wikipedia より引用：https://ja.wikipedia.org/wiki/%E3%83%90%E3%82%A4%E3%83%88%E5%AF%BE%E7%AC%A6%E5%8F%B7%E5%8C%96)

? BPE tokenization の学習 (※論文由来でなく、本発表においての呼称です)
10
提案手法
① vocabulary を a, b, … 等の『文字』で初期化
② 学習用の text データを vocabulary をもとに tokenize
③ ②の token 系列において各 token の『ペア』を数える
④ ③ で最も数の多かった『ペア』について vocabulary?operation に追加
⑤ vocabulary の大きさが規定の数に達したら終了。そうでない場合 ② へ
a, b, c, ...
?T h e _ s c o r e _ b e c o m e s _ l o w e r _ a n d _ l o w e r .
?I _ h a v e _ a c h i e v e d _ b e t t e r _ r e s u l t .
※簡単のため、２ sentences
vocabulary
text データ
operation

11
提案手法
a, b, c, ...
vocabulary
text データ
operation
『er』が最多！

12
提案手法
er, a, b, c, ...
vocabulary
text データ
e-r → er
operation
『er』を追加
こちらも追加

13
提案手法
er, a, b, c, ...
?T h e _ s c o r e _ b e c o m e s _ l o w er _ a n d _ l o w er .
?I _ h a v e _ a c h i e v e d _ b e t t er _ r e s u l t .
vocabulary
text データ
e-r → er
operation
以降 er をペアとして tokenize

? BPE tokenization の適用
? 文字分割した後 operation を順に適用
14
提案手法
e-r → er
t-h → th
th-e → the
operation
T h e _ s c o r e _ b e c o m e s _ l o w e r _ a n d _ l o w e r .
T h e _ s c o r e _ b e c o m e s _ l o w er _ a n d _ l o w er .
...
Th e _ s c o r e _ b e c o m e s _ l o w er _ a n d _ l o w er .
The _ s c o r e _ b e c o m e s _ l o w er _ a n d _ l o w er ....
↓
↓
↓

15
? BPE は効率よく tokenize できている
? # tokens: 全 token 数
? # types : vocabulary 数
? # UNK : test データにおける未知語数
? BPE (joint) : source & target の言語を同時に tokenize
BPE tokenization 性能
tokenize に関する各種統計量

16
? 様々な metric における他手法との比較
? shortlist (char-bigram 外で別途用意された vocabulary) 無しでも良い性能
? rare に対する性能が良い
BPE tokenization 性能

17
ACL
2018
Subword Regularization: Improving
Neural Network Translation Models
with Multiple Subword Candidates
[Kudo, 2018]

18
? BPE tokenization が決定的な tokenize を行うことを問題視
? hello は本来 hello, he-llo, h-e-l-l-o 等様々な分割が確率的に行われ得る
? 確率的に tokenize を行い正則化効果を期待する手法を提案
? unigram language model ベースの手法
? 効率的に計算するため、viterbi algorithm と EM algorithm を併用
概要

19
? tokenize は本来決定的ではなく確率的に表現される
? BPE tokenization は決定的
? 確率はサンプリングにより表現
アイデア
BPE 提案
サンプリングで表現
データサイズ (文章数)
モデルのパラメータ
tokenize された s 番目の翻訳対象
tokenize された s 番目の翻訳結果
s 番目の全 tokenize 候補について期待値計算

20
? 各 token が独立に生起するという仮定を置く
? token 候補は vocabulary (これが与えられている前提) の要素
? token 系列の生起確率がそれぞれの token のものの積で表現可能
? 後は token 系列の生起確率に応じて学習時にサンプリングすれば提案が実現可能
? 最も確率の高い token 系列は Viterbi algorithm で効率よく計算可能 (後ほど使用)
Unigram Language Model
I
at
a t
s
stay
stayed
ed
t a y e d
...
...
...
文の token 数
token 系列の生起確率

21
? 各 token が独立に生起するという仮定を置く
? token 候補は vocabulary (これが与えられている前提) の要素
? token 系列の生起確率がそれぞれの token のものの積で表現可能
? 後は token 系列の生起確率に応じて学習時にサンプリングすれば提案が実現可能
? 最も確率の高い token 系列は Viterbi algorithm で効率よく計算可能 (後ほど使用)
Unigram Language Model
I
at
a t
s
stay
stayed
ed
t a y e d
...
...
...
文の token 数
token 系列の生起確率
① 各 token の生起確率はどうやって求める？
② vocabulary はどうやって確定させる？

22
? L を最大化する EM algorithm (appendix) により各 token の生起確率を算出
? 隠れ変数 : 各 token の生起確率
? E-step : M-step の token 系列における出現頻度に基づき、隠れ変数を推定
? M-step : 各文について、viterbi algorithm で最も確からしい token 系列を求める
各 token の生起確率を求める
viterbi algorithm で計算可能

23
? L を最大化する EM algorithm (appendix) により各 token の生起確率を算出
? 隠れ変数 : 各 token の生起確率
? E-step : M-step の token 系列における出現頻度に基づき、隠れ変数を推定
? M-step : 各文について、viterbi algorithm で最も確からしい token 系列を求める
各 token の生起確率を求める
viterbi algorithm で計算可能
この操作自体も vocabulary がわからないと適用できない

24
? 反復試行により vocabulary を更新していく
? vocabulary の確定と token 生起確率の算出を同時に行うのは難しい
vocabulary を求める
① heuristic に大きめの vocabulary の初期値を設定
② EM algorithm により token の生起確率と現状最良の tokenization を算出
③ 各 token の loss を計算
→ loss は各 token を vocabulary から抜いたときの L の減少度合いから算出
④ loss により token を sort し、上位 η% を vocabulary から除外
⑤ vocabulary の大きさが指定したものより大きい場合、② に戻る

25
? 様々なコーパス上で baseline (BPE) といくつかの提案手法を比較
? l は各文についてのサンプリング数 (l=1 は unigram language model での best な tokenization を使用)
? one-best vs n-best は翻訳候補の算出数 (n-best では n 個の内 best なものを使用)
? 基本的に提案手法は有効、特にデータが少ない場合はより有効 (augmentation 効果)
提案手法の性能

26
ACL
2020
BPE-Dropout: Simple and E?ective
Subword Regularization
[Provilkov et al., 2020]

27
? tokenize における正則化をよりシンプルな BPE-dropout で実現
? 簡単に適用できる上、性能も良い
? kaggle とかでも使えそう？
概要

28
? operation を確率的に適用
? モデルの学習時のみ dropout し、推論時には通常の BPE と同様の tokenization
提案手法 (BPE-dropout)
e-r → er
t-h → th
th-e → the
operation
...
T h e _ s c o r e _ b e c o m e s _ ...
T h e _ s c o r e _ b e c o m e s _ ...
Th e _ s c o r e _ b e c o m e s _ ...
Th e _ s c o r e _ b e c o m e s _ ...
...
↓
↓
↓
例えばこれが適用され
なかった場合
The という単語は token にならない

29
? シンプルな手法にも関わらず良い性能

30
ACL
2020
Dynamic Programming Encoding for
Subword Segmentation in
Neural Machine Translation
[He et al., 2020]

31
? 『機械翻訳における翻訳先のみ』に tokenization 手法を提案
? vocabulary は BPE により与えられている前提
? Transformer を使った context 解釈による組合せ爆発の回避
? 動的計画法を使って効率的に計算
概要

32
? 機械翻訳における『翻訳先側の学習データ』の tokenization 手法
? Dynamic Programming Encoding (DPE)
アイデア

33
? tokenization の仕方についての周辺化を行いたい
? tokenization を潜在変数によるものとみなし、その周辺化により得られる
文章自体の生起確率の最大化を目指す
? 一方、組み合わせ爆発で現実的ではない...
→ 『context』を文字ベースにして組み合わせ削減！
アイデア
(翻訳結果の) 文章 (文字の系列) 自体の生起確率
『context』= i 番目までの tokenization された翻訳結果に依存
(書かれていないが、翻訳元 x にも依存)
tokenize の仕方 (非常に多い)
i 番目の token

34
? Transformer を使い、k 文字目におけるtoken 生起分布を算出
? 入力が『文字』?出力が『token 生起分布』
? ただし、(algirithm に表記はないが) 翻訳元 x は tokenize された形で入力
? 動的計画法により効率よく (mixed character-subword transformer を) 学習
提案手法
周辺化して得られた y (翻訳元の文) の生起確率 1 になるように学習
transformer で求める
周辺化された『context』の生起確率。
このやり方重複計算を防ぐ (動的計画法)
vocabulary にある token に限定
k 文字目『まで』

35
? 推論 (tokenization) 時には最も確率の高い token 系列を利用
? これにより『機械翻訳の翻訳先』の tokenization ができ、この token 系列を
使って機械翻訳を学習。『翻訳元』には BPE dropout を使用
? 学習における log-sum-exp が max になった形
提案手法

36
? 翻訳性能を BLEU で評価
? ほぼ全てのケースでBPE dropout よりも高性能
? 翻訳元の tokenization に依存して翻訳先の tokenize をできるのが性能の一因

37
? BPE やそこから派生する４つの論文の概要を紹介
? 確率の表現やその高速化における様々な工夫
? 詳細な実験結果等は論文を参照
? 以上！
まとめ

38
Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002.
Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL.
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016.
Neural machine translation of rare words with subword units. In Proc. of ACL.
Philip Gage. 1994.
A new algorithm for data compression. C Users J. 12(2):23–38.
Taku Kudo. 2018.
Subword regularization: Improving neural network translation models with multiple subword candidates.
In Proc. of ACL.
Ivan Provilkov, Dmitrii Emelianenko, and Elena Voita. 2020.
Bpe-dropout: Simple and e?ective subword regularization. In Proc. of ACL.
Xuanli He, Gholamreza Ha?ari, Mohammad Norouzi. 2020.
Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation. In Proc. of ACL.
参考文献

40
? 隠れ変数がある場合に尤度の最大化をしたい際に使用する
? 尤度が隠れ変数に依存するが、隠れ変数は観測できない
→ 以下の施行で尤度の期待値の最大化を繰り返す
① データと現状の予測パラメータから隠れ変数の事後確率を計算し、
これを用いて (対数) 尤度の期待値を計算 (E step)
② (対数) 尤度の期待値を最大化すパラメータを推定 (M step)
③ 収束しない場合 ① に戻る
? ex) 混合正規分布のパラメータ予測
? 隠れ変数 : 各サンプルがどの正規分布に属するか (0 or 1)
EM algorithm

狠狠撸

Acl2020 taguchi

More Related Content

Acl2020 taguchi