Ask the right question
Active Question Reformulation with Reinforcement Learning

2018.06.11 伎
Table of Contents
1. Reinforcement Learning

2. Active Question Answering

3. BiDirectional Attention Flow

4. Experiment

5. Analysis of The Agents Language
Reinforcement Learning
Reinforcement Learning
 Reinforcement Learning = Reinforcement + Machine Learning

 What is Reinforcement?

 覦一一 讌襷 讌 覃伎 覲伎  襷 視  襯 企 蟆 

 Ex) Skinner 覓語 ろ
Reinforcement Learning
 What is Reinforcement Learning

 一危 X: 企  企  讌

  Y: 朱 覲伎 覦讌

 一危一  伎 蟯蟯螻襯   覲伎 襷 覦蟆  豈 
Reinforcement Learning
 What is Reinforcement Learning

 Agent Environment 語  一危 (state, action, reward history )

 豕 policy襯 谿城 蟆 覈   reward襯 豕 蟆

 Agent: 襯 蟯谿壱螻   覈 讌レ 譯殊牡

 Environment: agent襯 誤 襾語
Markov Decision Process
 MDP(Markov Decision Process)

 Sequential decision making 覓語  framework

 5-tuple (state, action, reward, transition probability, discount factor)
Markov Decision Process
 蠏碁Μ MDP 
Markov Decision Process
 蠏碁Μ MDP 

 旧 覈 豕 policy襯 谿城 蟆

 Policy = 轟 state 轟 action  probability
Markov Decision Process
 Return: 轟 state 轟 action 豬 危 覦 reward
Policy Gradient
 Parameterized policy襯 螳 (linear function approx. or Neural Network)

 policy input state feature願碓 raw pixels / output probability of action
Policy Gradient
 Supervised Learning: Maximize log likelihood 

 れ  磯殊 policy襯 一危 (Imitation Learning, correct action label 譟伎)
Policy Gradient
 Policy Gradient

 maximized log likelihood of probability of taking action weighted by reward(return)
Policy Gradient
 + reward襯 覦 讓曙朱 policy distribution 企
Application of DeepRL
 Game play

 Alphago, Atari, Vizdoom


 robot arm manipulation, locomotion

 Natural language process

 Question Answering, Chatting

 Autonomous driving

Active Question Answering
Ask The Right Question
 ICLR 2018 Oral presentation朱 accept
 Jeopardy! : 覩瑚記 る 伎

 讌覓語 る 蠏 讌覓語 螳襴る   襷豢 伎 

  Question Answering 覓語
https://namu.wiki/w/Jeopardy! https://abcnews.go.com/Entertainment/jeopardy-things-americas-favorite-quiz-show/story?id=18824501
Jeopardy! Dataset
 れ螻 螳 一危一 螻糾 : Q&A pairs
SearchQA Dataset

 Matthew Dunn, Levent Sagun, Mike Higgins, Ugur Guney, Volkan Cirik, and Kyunghyun Cho.
SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. https://arxiv.org/
abs/1704.05179, 2017. 

 Github repo: https://github.com/nyu-dl/SearchQA

 Jeopardy! 讌覓瑚骸  dataset 覦朱伎 web 襦る

 140k question-answer pairs, 螳 pair 蠏 49.6 snippet

 螳 question襷 google querying

 譬  れ information retrieval system 襷 一危一
SearchQA Dataset
Problem De鍖nition
 Active Question Answering

 frame QA as a Reinforcement Learning problem

 English-to-English machine translation: paraphrasing 

 MDP襯 螳企慨!

 Agent: Reformulate

 Environment: Q&A system

 State: 一危一   讌覓

 Action: q (agent  reformulate 讌覓)

 Reward: question answering quality
AQA Model
1. QA Environment

 BiDirectional Attention Flow (BiDAF) 覈語 

 Question  answer襯 襷 (Training   reward)

 Reward: token level F1-score (answer quality 螳)

2. Reformulation Model

 Sequence-to-sequence model

 Multilingual translation 牛 pre-training

3. Answer Selection Model

 Test time  覈 (磯 旧貅 )

 Train   QA env output 牛 reward襯 螻, test   answer 譴 螳 譬 answer襯 螻殊 

 [query, rewrite, answer] embedding pre-training, 3螳 embedding concatenation

 1-d convolution 牛 binary classi鍖cation
Reformulation model
 Massive Exploration of Neural Machine Translation Architectures - Denny Britz, 2017
BiDirectional Attention Flow

 GitHub: https://github.com/allenai/bi-att-鍖ow

 query context襯 牛 answering  model

 Bidirectional attention 鍖ow mechanism 牛 query  context 

 SQuAD(Stanford Question Answering Dataset) 一危一  state-of-art(2017 朱 覦 轟)
BiDirectional Attention Flow
BiDirectional Attention Flow
 Character embedding + word embedding  contextual embedding 

 Attention Flow: not 鍖xed length + memoryless
Training: Reformulation
 Policy Gradient Training

 蟆郁記 磯Μ螳 螻 苦 蟆 譯殊伎 question  螳 譬 answer襯 襷れ企企 蟆企.

 Parameterized policy 

 policy seq2seq model企襦 れ螻 螳  螳

 Policy襯 牛 question る 蠏 question  environment螳 action   reward 

 れ reward襯 豕襦  policy襯
Training: Reformulation
 Policy Gradient Training

 旧螻襴讀 REINFORCE襯 : log likelihood gradient襯 磯 一危誤 reward螳 weight

 REINFORCE gradient estimate螳 high variance襯 螳讌る 覓語  baseline 

 sub-optimal 觜讌 蟆 覦讌(exploration ル)蠍  entropy regularization 豢螳
Training: Reformulation
 Policy Gradient Training

 豕譬 objective function れ螻 螳. baseline q_0 伎 reward 蠏 螻


 Paraphrasing  pre-training: translate English-to-English

 覿 一危磯 牛蠍  multilingual translation  (English-Spanish, French-English, etc.)

 Multilingual United Nations Parallel Corpus v1.0: 11.4M sentences

 Monolingual data 伎 豢螳  (small corpus) 

 Paralex database of question paraphrases: 1.5M pairs(1 question  4螳 paraphrase)
Training: Reformulation
 Pretraining setting

 Optimizer: Adam

 Learning rate: 0.001, train: 400M instances

 RL setting

 optimizer: SGD

 Train: 100k RL steps

 Batch size: 64

 Learning rate: 0.001

 Regularization weight: 0.001

 QA system GPU, reformulation model 旧 CPU
Training: Answer Selector
 Answer Selector : binary classi鍖cation

 reformulator螳 20螳 question   [query, rewrite, answer]  

 蠏 譴 螳 譬 answer襯 谿場伎 

 願  model 企 rewrite  answer螳 蠏 伎/危語 classi鍖cation

 蠏 伎: positive, 蠏 危: negative

 token  100 dimension embedding pre-training

 Query  embedding  100-d vector  1-d CNN(鍖lter size=3) 
 rewrite  embedding  100-d vector  1-d CNN(鍖lter size=3)  feed-forward network
 answer embedding  100-d vector  1-d CNN(鍖lter size=3) 
 (蠏碁 螳 譬 answer襯 企至 螻襯企 蟇伎..?)
` `
 EM, F1: 糾骸 model answer token level metric

 TopHyp: seq2seq model output 譴 豌 覯讌 reformulation 

 CNN: CNN-based selector襯 伎 best answer襯
Analysis of The Agents
Statistics of Questions
 Length: question  word 螳. TF(term frequency): question  覦覲給 word 螳

 DF(document frequency): question  token context   median

 QC(Query Clarity): question螻 reformulation 伎 relative entropy
gandhi deeply in鍖uenced count wrote war
Base-NMT Who in鍖uenced count wrote war?
What is name gandhi gandhi in鍖uence
wrote peace peace?
Statistics of Questions

 螳 syntactically well-formed question

 Lower DF: NMT training corpus螳 SearchQA 一危 螻 谿願 碁

 AQA-QR: TopHyp

 99.8% 螳 what is name朱 : 覈 answer螳 name螻 蟯 伎 企蟆 給 

 Less 鍖uent

 Multiple token 螻  蟆曙郁 SearchQA 觜 2覦
Paraphrasing Quality
 Image captioning dataset朱 paraphrasing quality襯 ろ

 MSCOCO 一危一 

 企語襷 5螳 caption  襯 source襦 螻 襾語 4螳襯 reference襦

 Base-NMT: 11.4 BLEU / AQA-QR: 8.6 BLEU
Reformulation Examples
Future work
 One-shot decision  Sequential Decision 

 Information seeking task

 End-to-end RL problem

 Closed loop between reformulator and selector
Thank you

