狠狠撸

NLPにおけるAttention
～Seq2Seq から BERTまで～
東京大学情報理工系研究科 M1 小野拓也
1

? この発表は以下の3論文をまとめたものです
“Neural Machine Translation by jointly learning to
align and translate”
“Attention Is All You Need”
“BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding”
? 深層学習によるNLPで近年重要な”Attention”について
その起源と発展を振り返ります．
はじめに
2/34

Outline
“Neural Machine Translation by jointly learning to
align and translate”
?LSTMによる翻訳モデル
?Attention + RNN
“Attention Is All You Need”
?AttentionによるRNNの置換
?Self-AttentionとTransformer
“BERT: Pre-training of Deep Bidirectional Transformers
for Language Understanding”
?事前学習：Masked LMとNext Sentence Prediction
?BERTの性能
3/34

? 2つのLSTMからなる翻訳モデル ???LSTMはRNNの一種
Seq2Seq
LSTM
Encoder
LSTM
Decoder
I am a studentInput
私は学生ですOutput
4/34

RNNについて
? RNNはループ構造を持ち，１つ前の出力を入力として使える．
? 深層学習で時系列データを扱うときによく登場する
RNNの時系列展開
5/34

時系列でみるSeq2Seqの動作内容
21 43 5 76 98
時刻
6/34

? 入力の長さにかかわらず，Encoderの出力を
固定長のベクトル表現(?4)に落とし込んでしまう
? 長さが50の文章でも長さ4の文章と同じサイズの表現になる
? キャパシティが定まっているため長文だと性能が落ちる
Seq2Seqの問題点
→ 文章の長さに応じた表現変換システムが欲しい
7/34

? Attention in RNNs （初出：” Neural Machine Translation by jointly learning to align and translate”）
? RNNの全時刻の出力を用いる
モジュール（Attention機構）を追加．(Attentionについては後ほど説明)
? 時系列の中から重要そうな情報を選ぶことができる
Attentionの発明
引用（一部改）github/tensorflow/tensorflow/blob/master/
tensorflow/contrib/eager/python/examples/nmt_with_attention
<余談>このNNモデルによって2016年10月頃にGoogle翻訳の性能が飛躍的に上がった
8/34

? 単純なLSTMによる翻訳モデルをみた
? そこそこの性能
? 文章の長さには対応できていなかった
? Attention in RNNs
? 初期のAttentionはRNNと併用されていた
? Attention Weightを用いて重み付き和を計算するシステムは
可変長入力に対応する重要な技術
Seq2Seqのまとめ
9/34

? そもそもRNNいらないのでは？
? Sequenceを読み込ませるので計算が遅い
? 長い文章だと計算がうまくいかない（勾配消失or勾配爆発が理由）
? RNNをAttentionで置き換えよう
? Transformerの提案 ”Attention Is All You Need”
RNNからAttentionへ
https://adventuresinmachinelearning.com/
recurrent-neural-networks-lstm-tutorial-
tensorflow/
10/34

? 著者/所属機関
? Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
? Google Brain, Google Research, University of Toronto
? 一言で言うと
? Encoder-DecoderモデルのRNNをAttentionで
置き換えたモデル，”Transformer”を提案．
短い訓練時間でありながら多くのタスクでSoTA．
論文 “Attention Is All You Need”
11/34

? EncodeとDecoder
再掲 Seq2Seq
EncoderとDecoderの橋渡しだけではなく，
EncoderとDecoder自身もAttentionで置き換える
12/34

? RNNの文章処理
文章”I am a student”について
“I”→”am”→”a”→”student”と順番に読み込ませることで，
RNNの内部状態を遷移させ，
”文章中の単語”という表現を獲得していた．
? 自身の入力に注意を向ける（自己注意）
RNNに代わる手法：自己注意
→文章の単語関係を単語のベクトル表現に埋め込む作用を持つ
13/34

自然言語処理におけるCNN
畳み込みを使って文章中の単語
関係を考慮したEmbedding表現
が得られる
（自己注意の方法の１つ）
問題点
?可変長の入力に対応できない
（畳み込みのカーネルサイズは固定 e.g. 3, 5）
?文脈を見ずに重みを決定している
14/34

? モチベーション
? 可変長の入力に対応しつつ，
文脈を考慮したEmbedding表現への
変換方法を知りたい
? 新たなアプローチ
? 入力の単語表現の重み付き和を
入力へのAttentionに基づいて計算する
http://fuyw.top/NLP_02_QANet/
Attention
15/34メモリネットワークの構成
<背景>
入力文に対応した，過去の記録を取ってくる
メモリネットワークと呼ばれるシステムについて
Attention機構により、性能が改善した研究成果がある
（Miller, 2016,” Key-Value Memory Networks for Directly Reading Documents”）

? 以下の単純なQuestion Answeringについて考える
例） Q. 好きな動物は？ → A. インコが好き
? このタスクは，以下のプロセスで成り立つ
「好きな動物は？」
↓
発話者の過去の記録から，好きな動物に関する言及をみる
↓
「インコが好き」との発言があったので，それを答えにする
? ここでは、与えられたクエリに対応する情報を
外部知識からとってくる操作（辞書的機能）を行っている
? Attentionは線形演算を主要な計算として、
辞書オブジェクトの役割を果たすことができる仕組み
外部知識を対象とするAttention
16/34
外部知識に対するAttention

Attentionとメモリ
? 単語は長さ４の埋め込み表現とする（e.g. “インコ”=[?0.02, ?0.16, 0.12, ?0.10]）
好き
は
動物
な
インコ
好き
がInput Memory
４×４の行列
３×４の行列
17/34

? 行列演算によってMemoryからKeyとValueをつくる
=
=
18/34

? QueryとKeyの積を取る???入力とメモリの関連度を計算
? Attention WeightとValueの積を取る???重みに従って値を取得
※左図はsoftmaxを
省略している
※左図は一層dense
layerを省略している
19/34

? 計算結果の例（インコ）
好き
は
動物
な
インコ好きが0.3× +0.05× +0.65×
インコ好きが0.3× + 0.4× + 0.3×
インコ好きが0.7× +0.05× +0.25×
インコ好きが0.3× + 0.4× + 0.3×
20/34

? 計算の全体像
https://qiita.com/halhorn/items/c91497522be27bde17ce 21/34

? Attention = 辞書機能
? 入力に対応した重みづけでメモリから値をとってくる
? メモリとはそもそも何？
? 「入力に従い，関連した情報を出力するオブジェクト」
? 例）文書記録，Question AnsweringにおけるQ文，翻訳モデルにおける原文
? Key-Valueペアに分ける理由は？
? Keyに従ってValueを引き出すという操作によって，
記憶の読み出しがスムーズになる
? KeyとValueを独立に作成することで，Key-Value間の変換が
非自明になり，表現力が高くなる
? Self-AttentionとTarget-Source Attention
? メモリーとして自分自身を使うものをSelf-Attention，
それ以外をTarget-Source Attentionという
? Target-Source AttentionはSeq2Seqにおいて
EncoderとDecoderの間で用いられていた手法
? Self-AttentionはRNNの代用として使える 22/34

Seｌｆ-Attention
? MemoryとInputが同じ
? 意味：“入力”から”入力”に関連している部分を持ってくる
? 入力の各単語間の関係を考慮した単語ベクトル表現が得られる 23/34

? Transformerのアーキテクチャ
? 左半分がEncoder，右半分がDecoder
Transformer
Encoderの入力は，入力自身のみ
→Self-Attentionそのもの
DecoderはSelf-Attentionと
Target-Source Attentionの併用.
(BERTではEncoderしか使わない
ため説明は省略）
24/34

“Attention Is All You Need”まとめ
? メモリーつきAttentionをみた
? 自身の入力に注目するSelf-Attentionを導入，
構造からRNNを排したTransformerの完成
? 並列計算できる
? 可変長の入力にうまく対応できる
RNNの排除
25/34

? 著者/所属機関
? Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
? Google AI Language
? 一言で言うと
? 多層に積み重ねたTransformerによって，
文章を文脈を考慮した単語表現にEmbeddingするモデル
論文 “BERT: pre-training of deep bidirectional transformers
for language understanding”
26/34

? Self-Attentionのみ
? TransformerのEncoderだけ使う
BERTのアーキテクチャ
※この図だと2層だが実際は24層
27/34

? TransformerのEncoder（Self-attention）を通すことで
文脈を考慮した単語分散表現が得られる
BERTの意義
http://jalammar.github.io
/illustrated-bert/
※ELMoはBERT以前の
文脈単語表現モデル．
ネットワーク構造は
Bi-LSTM．
28/34

? Masked LM
? 文章の一部を[MASK]トークンに置き換え，予測させる
（この手法の初出は“Cloze Procedure: A New Tool for Measuring Readability”で1953年の論文）
BERTの事前学習①
/illustrated-bert/
29/34

? Next Sentence Prediction
? ２つの文章が隣接しているかを当てる
BERTの事前学習②
/illustrated-bert/
30/34

“For each task, we simply plug in the task-specific inputs and
outputs into BERT and finetune all the parameters end-to-end.”
Fine-tuningは事前学習済みモデルを特定の
タスク用に再学習することを指す
例）
クラス分類ではInputの文頭に
[CLS]トークンを置き．その位置の
BERT出力にネットワークをかませて予測する
BERTのfine-tuning
31/34

? 実験結果
BERTの性能
(引用
32/34

? BERTは汎用的なNLPの事前学習モデル
? 単純なアーキテクチャ（TransformerのEncoderを重ねただけ）
? Encoderで訓練するためにMasked LMと呼ばれる手法を採用した
? 文脈を考慮した単語の分散表現が得られる
? 得られた分散表現は非常に強力で，BERTのtop layerに
単純な線形変換を連結するだけで，タスクを解くことが可能
? 事前学習済みのBERTをfine-tuneすることで
あらゆるNLPタスクを解決できる？
BERTまとめ
33/34

参考文献（論文以外）
論文解説 Attention Is All You Need (Transformer)
? http://deeplearning.hatenablog.com/entry/transformer
作って理解する Transformer / Attention
? https://qiita.com/halhorn/items/c91497522be27bde17ce
The Illustrated Transformer
? https://jalammar.github.io/illustrated-transformer/
Neural Machine Translation with Attention
? https://www.tensorflow.org/beta/tutorials/text/nmt_with_attention
Transformer model for language understanding
? https://www.tensorflow.org/beta/tutorials/text/transformer
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
? http://jalammar.github.io/illustrated-bert/
ゼロから作るDeep Learning② - 自然言語処理編
? 斎藤康毅, 2018/07/21, オライリー社
34/34

? 翻訳タスクにおけるTransformerのEncoder-Decoderアニメーション
付録①
https://mchromiak.github.io/articles/2017/Sep/12/Transformer-Attention-is-all-you-need/#.XR_-buj7SM8
35/34

? Positional Encoding
??left ???, ? = sin(??? ?
1
10000
??
2
?????
)
??right ???, ? = cos(??? ?
1
10000
??
2
??????1
)
付録②
???:何番目の単語か, ? :単語埋め込み表現中の位置
（先頭から数えて），?????：単語埋め込み表現の長さ
ちなみにleftは? < ?????/2，rightは? ≥ ?????/2を示す
https://jalammar.github.io/illustrated-transformer/
36/34

? Positional Encoding（pos_max=50,depth=516の例）
付録②
37/34

狠狠撸

NLPにおけるAttention～Seq2Seq から BERTまで～

Recommended

More Related Content

What's hot (20)

Similar to NLPにおけるAttention～Seq2Seq から BERTまで～ (20)

NLPにおけるAttention～Seq2Seq から BERTまで～

Editor's Notes