ºÝºÝߣ

ºÝºÝߣShare a Scribd company logo
A Comparative Study on Schema-guided
Dialog State Tracking
NAACL 2021
Jie Cao, Yi Zhang
Outlines
¡ñ Motivation
¡ñ Task Description and Datasets
¡ñ Three Comparative Studies
i. Encoder Architectures
ii. Supplementary training
iii. Impact of Schema Description Style
¡ñ Q&A
Motivation
Challenges of Virtual Assistants (Task-oriented)
¡ñ Increasing number of new services and APIs ¡ú (new annotation, new model retraining)
¡ñ Heterogeneous Interfaces for similar services, precisely understanding overlapping
functionalities.
¡ñ How to integrate common sense and world knowledge?
Schema-guided Dialogue State Tracking
Using Natural language description to explain the functionalities of tags, to help generalizing to unseen
tags in unseen domains.
Adding Intent
Description
Adding Slot
Description,
Type,
Possible values
{
"name": "SearchFlight",
"description": "Find a flight itinerary
between cities for a given date¡±,
¡°required_slots¡±: ¡­..
},
{
"name": "FindFlight",
"description": ¡°Search for flights to a
desitinaion¡±,
¡°required_slots¡±:¡­¡­
},
Flight Service 1 Flight Service 2
{
"name": ¡°num_stops",
"description": ¡°Number of layovers
in the flight",
"is_categorical": true,
"possible_values": [¡°0¡±,¡±1¡±,¡±2¡±]
},
{
"name": ¡°is_direct",
"description": ¡°Whether the flight
directly arrive without any stop¡±,
"is_categorical": True,
"possible_values": [¡°true¡±,¡±false¡±]
},
Schema-guided Dialogue State Tracking
Given service schema with description and dialogue history,
predict dialog states after each user turn.
Schema-guided Dialogue State Tracking
Two Datasets:
1. Google SG-DST
2. MultiWOZ 2.2
Four Subtasks Each Turn:
1. Active Intent Classification
2. Requested Slot
3. Categorical Slot Values
a. Boolean
b. Predefined Values
i. Numeric
ii. Text
4. Non-Categorical Slot
Values
a. Span-based Value
Datasets
SG-DST has more overlapping functionalities than MultiWOZ 2.2
Challenges: Three Comparative Studies
Q1: How to encode the dialog and schema?
¡ñ For each turn, matching the same dialog history with all
schema descriptions multiple times.
¡ñ Sentence-pair(SNLI) and Token-level classification(QA)
Q2: How do different supplementary trainings help?
¡ñ Zero-shot learning for unseen services
Q3: How the model performs on various description styles?
¡ñ Unseen service may have heterogeneous styles.
Q1: How to encode the dialog and schema? (Cross-Encoder)
Cons: a lot of recomputing, slow
a. Dialog encoded multiple times within the same turn
b. Schema encoded multiple times across different turns.
Pros:
Accurate, each representation
is contextualized via
full-attention
Q1: How should the dialog and schema be encoded? (Dual-Encoder)
Pros:
Encoding dialog history and schema independently, can be
percomputed once and cached. Fast inference.
Cons:
a. Local self-attention
b. inaccurate
Q1: How to encode the dialog and schema?
Pros: moderate inference
Still independent precomputing.
but a thin full-attention fusion layer for better performance
Cons:
Moderate accuracy
Q1: How to encode the dialog and schema(for 4 subtasks) ?
By caching the token embedding instead of the single CLS embedding, a simple partial-attention
Fusion-Encoder can achieve much better performance than Dual-Encoder, while still infers two
times faster than Cross-Encoder
Q2: How do different supplementary trainings help?
Pretraining Fine-tuning
Pretraining Supplementary Training Fine-tuning
Target Task
Pre-training Task
Target Task
Pre-training Task Intermediate Tasks: NLI, QA
With Similar Problem Structures
Model Hub
e.g. hugging face,
ParlAI
Q2: How do different supplementary trainings help?
¡ñ SNLI only helps for Intent (emphasizing the whole sentence entailment), although
Req and Cat are also sentence-pair classification tasks.
¡ñ SQuAD consistently helps for non-categorical slot identification tasks, due to
span-based retrieving
¡ñ Supplementary training helps more on unseen services
Q3: How the model performs on various description styles?
Background:
1. To compatible with previous tag-based DST system, many
previous papers show simply adding question format to those
tags may help.
a. Is name-based description enough?
b. Does question format helps?
2. Unseen services may use different description style
a. heterogeneous evaluation?
Q3: How the model performs on various description styles? (homogeneous)
¡ñ Most name are meaningful, and perform not bad, especially on Intent/Req subtasks
¡ñ Rich description outperforms the name-based on NonCat, but inconsistent on other
tasks.
Is named-based description enough?
Q3: How the model performs on various description styles? (homogeneous)
¡ñ It generally helps on Cat/NonCat
¡ñ Adding it to rich description will benefit more from SQuAD2
supplementary training on unseen. However, not on MultiWOZ.
Is question-format helpful?
Q3: How the model performs on various description styles? (Heterogeneous)
¡ñ For unseen styles, all tasks surfer from inconsistencies, though to varying degrees
¡ñ For paraphrased styles, richer description are relatively more robust than named-based
descriptions.
What if unseen service in different description styles?
Takeaways
1. Cross-Encoder > Fusion-Encoder > Dual-Encoder in accuracy, while opposite
on inference speed.
2. To support low-resource unseen services, we quantified the gain via
supplementary training on different subtasks.
3. Simple named-based description are actually meaningful, and they perform
not bad, but not as robust as rich description in most cases.
4. All subtasks suffers from inconsistencies when using heterogeneous
description on unseen services, which requires future work on more robust
cross-style schema-guided dialog modeling.
Q&A?
Thanks

More Related Content

A Comparative Study on Schema-guided Dialog State Tracking

  • 1. A Comparative Study on Schema-guided Dialog State Tracking NAACL 2021 Jie Cao, Yi Zhang
  • 2. Outlines ¡ñ Motivation ¡ñ Task Description and Datasets ¡ñ Three Comparative Studies i. Encoder Architectures ii. Supplementary training iii. Impact of Schema Description Style ¡ñ Q&A
  • 3. Motivation Challenges of Virtual Assistants (Task-oriented) ¡ñ Increasing number of new services and APIs ¡ú (new annotation, new model retraining) ¡ñ Heterogeneous Interfaces for similar services, precisely understanding overlapping functionalities. ¡ñ How to integrate common sense and world knowledge?
  • 4. Schema-guided Dialogue State Tracking Using Natural language description to explain the functionalities of tags, to help generalizing to unseen tags in unseen domains. Adding Intent Description Adding Slot Description, Type, Possible values { "name": "SearchFlight", "description": "Find a flight itinerary between cities for a given date¡±, ¡°required_slots¡±: ¡­.. }, { "name": "FindFlight", "description": ¡°Search for flights to a desitinaion¡±, ¡°required_slots¡±:¡­¡­ }, Flight Service 1 Flight Service 2 { "name": ¡°num_stops", "description": ¡°Number of layovers in the flight", "is_categorical": true, "possible_values": [¡°0¡±,¡±1¡±,¡±2¡±] }, { "name": ¡°is_direct", "description": ¡°Whether the flight directly arrive without any stop¡±, "is_categorical": True, "possible_values": [¡°true¡±,¡±false¡±] },
  • 5. Schema-guided Dialogue State Tracking Given service schema with description and dialogue history, predict dialog states after each user turn.
  • 6. Schema-guided Dialogue State Tracking Two Datasets: 1. Google SG-DST 2. MultiWOZ 2.2 Four Subtasks Each Turn: 1. Active Intent Classification 2. Requested Slot 3. Categorical Slot Values a. Boolean b. Predefined Values i. Numeric ii. Text 4. Non-Categorical Slot Values a. Span-based Value
  • 7. Datasets SG-DST has more overlapping functionalities than MultiWOZ 2.2
  • 8. Challenges: Three Comparative Studies Q1: How to encode the dialog and schema? ¡ñ For each turn, matching the same dialog history with all schema descriptions multiple times. ¡ñ Sentence-pair(SNLI) and Token-level classification(QA) Q2: How do different supplementary trainings help? ¡ñ Zero-shot learning for unseen services Q3: How the model performs on various description styles? ¡ñ Unseen service may have heterogeneous styles.
  • 9. Q1: How to encode the dialog and schema? (Cross-Encoder) Cons: a lot of recomputing, slow a. Dialog encoded multiple times within the same turn b. Schema encoded multiple times across different turns. Pros: Accurate, each representation is contextualized via full-attention
  • 10. Q1: How should the dialog and schema be encoded? (Dual-Encoder) Pros: Encoding dialog history and schema independently, can be percomputed once and cached. Fast inference. Cons: a. Local self-attention b. inaccurate
  • 11. Q1: How to encode the dialog and schema? Pros: moderate inference Still independent precomputing. but a thin full-attention fusion layer for better performance Cons: Moderate accuracy
  • 12. Q1: How to encode the dialog and schema(for 4 subtasks) ? By caching the token embedding instead of the single CLS embedding, a simple partial-attention Fusion-Encoder can achieve much better performance than Dual-Encoder, while still infers two times faster than Cross-Encoder
  • 13. Q2: How do different supplementary trainings help? Pretraining Fine-tuning Pretraining Supplementary Training Fine-tuning Target Task Pre-training Task Target Task Pre-training Task Intermediate Tasks: NLI, QA With Similar Problem Structures Model Hub e.g. hugging face, ParlAI
  • 14. Q2: How do different supplementary trainings help? ¡ñ SNLI only helps for Intent (emphasizing the whole sentence entailment), although Req and Cat are also sentence-pair classification tasks. ¡ñ SQuAD consistently helps for non-categorical slot identification tasks, due to span-based retrieving ¡ñ Supplementary training helps more on unseen services
  • 15. Q3: How the model performs on various description styles? Background: 1. To compatible with previous tag-based DST system, many previous papers show simply adding question format to those tags may help. a. Is name-based description enough? b. Does question format helps? 2. Unseen services may use different description style a. heterogeneous evaluation?
  • 16. Q3: How the model performs on various description styles? (homogeneous) ¡ñ Most name are meaningful, and perform not bad, especially on Intent/Req subtasks ¡ñ Rich description outperforms the name-based on NonCat, but inconsistent on other tasks. Is named-based description enough?
  • 17. Q3: How the model performs on various description styles? (homogeneous) ¡ñ It generally helps on Cat/NonCat ¡ñ Adding it to rich description will benefit more from SQuAD2 supplementary training on unseen. However, not on MultiWOZ. Is question-format helpful?
  • 18. Q3: How the model performs on various description styles? (Heterogeneous) ¡ñ For unseen styles, all tasks surfer from inconsistencies, though to varying degrees ¡ñ For paraphrased styles, richer description are relatively more robust than named-based descriptions. What if unseen service in different description styles?
  • 19. Takeaways 1. Cross-Encoder > Fusion-Encoder > Dual-Encoder in accuracy, while opposite on inference speed. 2. To support low-resource unseen services, we quantified the gain via supplementary training on different subtasks. 3. Simple named-based description are actually meaningful, and they perform not bad, but not as robust as rich description in most cases. 4. All subtasks suffers from inconsistencies when using heterogeneous description on unseen services, which requires future work on more robust cross-style schema-guided dialog modeling.