際際滷

際際滷Share a Scribd company logo
Taking the Road Less Travelled:
In pursuit of a Multi-modal
experience for Bixby
Samsung R&D Bangalore, India
Dr. Vikram Vij
vikram.v@samsung.com
Intelligent Assistants are fast emerging as the next breakthrough
user interface
1990s
Web
2000s
Apps
Today
Assistants
Images references form
Evolution of Human Computer Interface
GUI
(~1980s)
Touch UI
(~2000)
Voice
(2011)
Bixby
(2017)
Changes of Interface Paradigm
Voice Assistant Market Research Report
Global Forecast 2023
Reference : https://www.marketresearchfuture.com/reports/voice-assistant-market-4003
Bixby Introduction
Bixby is an intelligent, personalized voice interface for your phone.
Its multi-modal - lets you seamlessly switch between voice and touch modes.
o Launch Date : 19th July 2017 (US), 22nd Aug (Global)
o Available in more than 200 countries
o More than 75 Domains supported (Camera, Gallery, Messages, WhatsApp, Youtube, Uber etc. )
o More than 27 million registered users
http://bixby.samsung.com/meet-bixby
https://www.youtube.com/watch?v=dbmVtseEjo4&index=1&list=PLrV44rSVouDcbvky1f77mUjWLCq8WI-Z1
https://www.youtube.com/watch?v=Gcd4NpK2fTI
Bixby Live Demo
Bixby Overview
Supporting every task of
the application
Understanding the current
context and state of app
Find an
umbrella photo
Manual editing
VOICE
TOUCH
VOICE
1
2
3
Understanding commands
with incomplete info
Send this photo
via message
To whom?
To Jane
Done
Incomplete Command
A true one click action
- Turn on
- Authenticate
- Unlock
- Wake the phone
- Execute the command
Supporting Samsungs
native apps

Request
incomplete.
Error
Show me the Wi-Fi data
usage
Press &
Hold
Bixby is fundamentally different from other voice agents or
assistants in the market because of its ..
Post it on
Instagram
Completeness Context Awareness Cognitive Tolerance Frictionless
Bixby - Cognitive Tolerance
ASRIncomplete or inaccurate instructions are also performed under the context..
Bixby | Human Computer Interface Revolution
With English Support, Samsung's Bixby Impresses Vs. Siri And
Google Assistant
Bixby is perhaps in the most precarious spot, as its going to be
competing directly against Google Assistant on some devices. Bixbys
capabilities sound quite impressive thanks to its integration with
other Samsung apps
Galaxy S8's voice sidekick can do things Siri can't
Bixby v1.0: Minimalistic View
ASR
NLU
voice packet
text input
command
ASR
ASR: Automatic Speech Recognition
NLU: Natural Language Understanding
Traditional NLU Flow
NLU
Platform
mom
Text to Mom Machine Learning Models
Command
Domain
Classifier
Intent
Classifier
Slot
Tagger
Messages Send Message Mom
Key Challenges
Design
oText and Voice : Co-existence of Dual Modality
oRepresentation of Massive Input Space
oManagement of Massive Context
oHandling of Variable Output Space
oDesign of Deep Learning Architecture to Achieve this
Data
oManaging the distribution and variations of data
oBalance of Data to maintain the expected distribution of data across different
classes
oSpecial handling for rejection Data
Bixby: The Multi-Modal Point of View
 Home  Settings  Connections  Data Usage
Touch Interface Voice Interface
+
Show me the mobile data usage
Bixby: The Multi-Modal Point of View (contd)
Touch UI
Screen Flow
Voice UI
Find Hawaii photos in Gallery
Context Context Context Context
find James in Contacts application => contact information of James
find James in Gallery application => images tagged as James
Leap Required for NLU toward Multi-Modality
Traditional
NLU
Multi-Modal
NLU
Context
Awareness
Massive Number of Contexts Varying Set of Commands





Thousands of states
Note8 





S8
TabS
Various device models,
apps, locales,
Input Space = (2,000 Contexts) x (Utterances for 6,000 commands)
Challenge of Massive Contextual Input Space
Find James+
Picture View Context
Find James+
Contact View Context
James Picture
James Contact

Static
Classifier
Static
Classifier
Static
Classifier
Static
Classifier






6000+ command classes
Context Space
2000+ contexts
Deep Learning was chosen instead of SVMs, Random Forest etc.
 Massive number of Classes
 Approximately 60 Classes for Domains
 Approximately 6K Classes for Intents
 Closeness of Domains
 The nature of classes are similar
 Examples: Reminder, Calendar and Clock
 Huge Data
 10M data for Domain Classification
 1.5M data per Intent Classification (on average per Domain)
Motivation for Deep Learning
Domain
Classification
Intent
Classification
Slot Tagger
Utterance
 
 
Slots
Domain Label
Intent Label
Approach for Massive Contextual Input Space
Context-conditioned DNN classifier + Sampling
Context-Aware
DNN Classifier
Sampling
6000+ commands
Context + Utterance
context_留 utterance_b  command_1+
context_留 utterance_c  command_2+

context_留 utterance_a  command_1+
context_硫 utterance_b  command_2+
context_硫 utterance_c  command_2+

context_硫 utterance_a  command_1+



Training Set
Input Output
Hierarchical classifier
Session based architecture
Rejection Logic in Intent
 RNN word model had difficulty in:
 Handling unknowns (word misspellings)
 Learning word inflections (word boundary going beyond representation)
 State based learning
 So switched to CNN character model
Challenge of RNN vs CNN
~~~ utt ~~
~~~ utt ~~
.
.
.
~~~ utt ~~
vs
e.g. search for s8 plus goes to calculator domain
e.g. Settings Bluetooth Screen : turn off please
Issue : State is not learnt (Wifi off is detected)
 Determining the Optimal Filter Size
 Smaller filter size used for sub-word level features
 Larger filter size used for understanding language structures
Challenge of CNN Filter Size
Multiple filters with various sizes work in parallel
Final layer of CNN which gives best output
Reference : hackerearth.com
Challenge of Variable Output Space
App VersionDevice Models Locale
India V 1.1



Turn on Bluetooth tethering
Turn on USB tethering
Turn on tethering
Note8 





S8
TabS
Model A
Model B
Approach for Variable Output Space
Version Management Mechanism for NLU Engine
Note 8
Country
Installed app info
OS version
Version Metadata

Version mask vectors
V1 





V2
V3
Device
Sever
Version DB
NLU Core
Command
Classification
Key Learnings - Design
 Need to experiment with various DNN Architectures & parameters  make
sure experiments have a rationale
 Obvious choice of DNN may not work the best  in text RNNs typically used
but CNNs proved to be better
 Hierarchical design may work better (e.g. text classification)
 Feature based matching for intent classes where 100% accuracy is needed
 Rule-Based Matching of NER instead of ML/DL based NER
 Rejection Based Intent Classification for Close Domains
 Can abstract out complexity where possible (e.g. variable output space)
Massive Data Flow
Synthetic
Generation of Data
Purchased (3rd Party )
Data
Crawled Data for
Out of Domain
Voice of Customer Data
Quick Grammar Data
DC
Bucketed and annotated
for Single Intent Class
DC and Intent Separated
by Class Levels
Bucketed by Single Intent
Class
Special Data
Market Issues & Bug Fixes
for Intent and Domain
Sampled 2K/Class
Hand-cleanedandConsumedTotal
Hand-cleaned&DownSampled
Sampled2K/Class
ServiceAPILayer
Intent Slot
Sampled 10- ~ 20K/Class
Sampled 10- ~ 20K/Class
Hand-cleaned & Down Sampled
Data Governance  Training Data
Used Tools to detect & resolve data conflicts across
Domains & Intents
 TF-IDF based tool
 Cosine similarity based tool
Data Governance  Test Data
Unit Testing Automation E2E Testing Automation
In- House Automated
Unit Test Tool for
Domain , Intent and
Slot
DEV
Server
Accepted ? Accepted ?
STG
Server
Accepted ?
PRD
Server
Development and
Management of Data
Analysis based on Data Governance Tool
Y Y Y
NNN
End User
VOC Issues
Key Learnings - Data
 Managing the distribution and variations of data is essential
 Quality of Data is critical
o Balance of Data to maintain the expected distribution of data across different classes
o Special handling for rejection Data
 A Deep Learning Engineer / Data Scientist must spend 30% of his or her time in
looking at the data
 People are needed to manage this volume of data
 Tools / Automation need to be developed for pre-processing of data
 We can not avoid hand-cleaning or hand-engineering of data
 Obvious need for Data Governance as well as Continuous Monitoring of product
quality.
 The NLP / ML driven project cycle (including data) is quite different from
conventional SW project cycle
Samsung voice intelligence.v5.5
ASR: Challenge of Speech
Is different for every speaker
May be fast, slow, or varying in speed
May have high pitch, low pitch, or be whispered
Has widely-varying types of environmental noise
Changes depending on sequence of phonemes
Changes depending on speaking style
May not have distinct boundaries between units
Changes depending on the semantics of the utterance
Has an unlimited number of words
Bixby ASR - Fundamentals
Language
Model(s)
voice packet
Feature
Extraction Decoder
Acoustic
Model(s)
ASR
System
ASR
Hypothesis
Inverse
Text
Normalization
 Acoustic Model
 Links Acoustics to Word/phoneme sequence
 Estimates the likelihood of acoustic sequence given a
word/phoneme (LSTM)
 Language Model
 Prior on word sequences
 Probability of a word given the preceding words (n-gram)
 Decoder
 Find the best word sequence, i.e. searching for the
lowest-cost path in a graph
 Uses Viterbi algorithm (dynamic programming)
Bixby ASR - Fundamentals
Bixby ASR  Multi Accent
United States China
India
United Kingdom
SpainSouth Korea
DEFAULT ACCENTED
On-Boarding
Utterances
SIM Card
Information
Keyboard
Language
Contact
Details
Accent Determination
Based on:
Australia
Canada
Challenge for Indian Market
 Hindi targeted as language of experimentation.
 Indian Languages e.g. Hindi is used in conjunction with English
e.g. camera 爐爛爐迦ぞ 爐爐萎
 We have developed bi-lingual (English + Hindi) model for Hindi classifier

More Related Content

Samsung voice intelligence.v5.5

  • 1. Taking the Road Less Travelled: In pursuit of a Multi-modal experience for Bixby Samsung R&D Bangalore, India Dr. Vikram Vij vikram.v@samsung.com
  • 2. Intelligent Assistants are fast emerging as the next breakthrough user interface 1990s Web 2000s Apps Today Assistants Images references form
  • 3. Evolution of Human Computer Interface GUI (~1980s) Touch UI (~2000) Voice (2011) Bixby (2017) Changes of Interface Paradigm Voice Assistant Market Research Report Global Forecast 2023 Reference : https://www.marketresearchfuture.com/reports/voice-assistant-market-4003
  • 4. Bixby Introduction Bixby is an intelligent, personalized voice interface for your phone. Its multi-modal - lets you seamlessly switch between voice and touch modes. o Launch Date : 19th July 2017 (US), 22nd Aug (Global) o Available in more than 200 countries o More than 75 Domains supported (Camera, Gallery, Messages, WhatsApp, Youtube, Uber etc. ) o More than 27 million registered users http://bixby.samsung.com/meet-bixby https://www.youtube.com/watch?v=dbmVtseEjo4&index=1&list=PLrV44rSVouDcbvky1f77mUjWLCq8WI-Z1 https://www.youtube.com/watch?v=Gcd4NpK2fTI
  • 6. Bixby Overview Supporting every task of the application Understanding the current context and state of app Find an umbrella photo Manual editing VOICE TOUCH VOICE 1 2 3 Understanding commands with incomplete info Send this photo via message To whom? To Jane Done Incomplete Command A true one click action - Turn on - Authenticate - Unlock - Wake the phone - Execute the command Supporting Samsungs native apps Request incomplete. Error Show me the Wi-Fi data usage Press & Hold Bixby is fundamentally different from other voice agents or assistants in the market because of its .. Post it on Instagram Completeness Context Awareness Cognitive Tolerance Frictionless
  • 7. Bixby - Cognitive Tolerance ASRIncomplete or inaccurate instructions are also performed under the context..
  • 8. Bixby | Human Computer Interface Revolution With English Support, Samsung's Bixby Impresses Vs. Siri And Google Assistant Bixby is perhaps in the most precarious spot, as its going to be competing directly against Google Assistant on some devices. Bixbys capabilities sound quite impressive thanks to its integration with other Samsung apps Galaxy S8's voice sidekick can do things Siri can't
  • 9. Bixby v1.0: Minimalistic View ASR NLU voice packet text input command ASR ASR: Automatic Speech Recognition NLU: Natural Language Understanding
  • 10. Traditional NLU Flow NLU Platform mom Text to Mom Machine Learning Models Command Domain Classifier Intent Classifier Slot Tagger Messages Send Message Mom
  • 11. Key Challenges Design oText and Voice : Co-existence of Dual Modality oRepresentation of Massive Input Space oManagement of Massive Context oHandling of Variable Output Space oDesign of Deep Learning Architecture to Achieve this Data oManaging the distribution and variations of data oBalance of Data to maintain the expected distribution of data across different classes oSpecial handling for rejection Data
  • 12. Bixby: The Multi-Modal Point of View Home Settings Connections Data Usage Touch Interface Voice Interface + Show me the mobile data usage
  • 13. Bixby: The Multi-Modal Point of View (contd) Touch UI Screen Flow Voice UI Find Hawaii photos in Gallery Context Context Context Context find James in Contacts application => contact information of James find James in Gallery application => images tagged as James
  • 14. Leap Required for NLU toward Multi-Modality Traditional NLU Multi-Modal NLU Context Awareness Massive Number of Contexts Varying Set of Commands Thousands of states Note8 S8 TabS Various device models, apps, locales,
  • 15. Input Space = (2,000 Contexts) x (Utterances for 6,000 commands) Challenge of Massive Contextual Input Space Find James+ Picture View Context Find James+ Contact View Context James Picture James Contact Static Classifier Static Classifier Static Classifier Static Classifier 6000+ command classes Context Space 2000+ contexts
  • 16. Deep Learning was chosen instead of SVMs, Random Forest etc. Massive number of Classes Approximately 60 Classes for Domains Approximately 6K Classes for Intents Closeness of Domains The nature of classes are similar Examples: Reminder, Calendar and Clock Huge Data 10M data for Domain Classification 1.5M data per Intent Classification (on average per Domain) Motivation for Deep Learning Domain Classification Intent Classification Slot Tagger Utterance Slots Domain Label Intent Label
  • 17. Approach for Massive Contextual Input Space Context-conditioned DNN classifier + Sampling Context-Aware DNN Classifier Sampling 6000+ commands Context + Utterance context_留 utterance_b command_1+ context_留 utterance_c command_2+ context_留 utterance_a command_1+ context_硫 utterance_b command_2+ context_硫 utterance_c command_2+ context_硫 utterance_a command_1+ Training Set Input Output Hierarchical classifier Session based architecture Rejection Logic in Intent
  • 18. RNN word model had difficulty in: Handling unknowns (word misspellings) Learning word inflections (word boundary going beyond representation) State based learning So switched to CNN character model Challenge of RNN vs CNN ~~~ utt ~~ ~~~ utt ~~ . . . ~~~ utt ~~ vs e.g. search for s8 plus goes to calculator domain e.g. Settings Bluetooth Screen : turn off please Issue : State is not learnt (Wifi off is detected)
  • 19. Determining the Optimal Filter Size Smaller filter size used for sub-word level features Larger filter size used for understanding language structures Challenge of CNN Filter Size Multiple filters with various sizes work in parallel Final layer of CNN which gives best output Reference : hackerearth.com
  • 20. Challenge of Variable Output Space App VersionDevice Models Locale India V 1.1 Turn on Bluetooth tethering Turn on USB tethering Turn on tethering Note8 S8 TabS Model A Model B
  • 21. Approach for Variable Output Space Version Management Mechanism for NLU Engine Note 8 Country Installed app info OS version Version Metadata Version mask vectors V1 V2 V3 Device Sever Version DB NLU Core Command Classification
  • 22. Key Learnings - Design Need to experiment with various DNN Architectures & parameters make sure experiments have a rationale Obvious choice of DNN may not work the best in text RNNs typically used but CNNs proved to be better Hierarchical design may work better (e.g. text classification) Feature based matching for intent classes where 100% accuracy is needed Rule-Based Matching of NER instead of ML/DL based NER Rejection Based Intent Classification for Close Domains Can abstract out complexity where possible (e.g. variable output space)
  • 23. Massive Data Flow Synthetic Generation of Data Purchased (3rd Party ) Data Crawled Data for Out of Domain Voice of Customer Data Quick Grammar Data DC Bucketed and annotated for Single Intent Class DC and Intent Separated by Class Levels Bucketed by Single Intent Class Special Data Market Issues & Bug Fixes for Intent and Domain Sampled 2K/Class Hand-cleanedandConsumedTotal Hand-cleaned&DownSampled Sampled2K/Class ServiceAPILayer Intent Slot Sampled 10- ~ 20K/Class Sampled 10- ~ 20K/Class Hand-cleaned & Down Sampled
  • 24. Data Governance Training Data Used Tools to detect & resolve data conflicts across Domains & Intents TF-IDF based tool Cosine similarity based tool
  • 25. Data Governance Test Data Unit Testing Automation E2E Testing Automation In- House Automated Unit Test Tool for Domain , Intent and Slot DEV Server Accepted ? Accepted ? STG Server Accepted ? PRD Server Development and Management of Data Analysis based on Data Governance Tool Y Y Y NNN End User VOC Issues
  • 26. Key Learnings - Data Managing the distribution and variations of data is essential Quality of Data is critical o Balance of Data to maintain the expected distribution of data across different classes o Special handling for rejection Data A Deep Learning Engineer / Data Scientist must spend 30% of his or her time in looking at the data People are needed to manage this volume of data Tools / Automation need to be developed for pre-processing of data We can not avoid hand-cleaning or hand-engineering of data Obvious need for Data Governance as well as Continuous Monitoring of product quality. The NLP / ML driven project cycle (including data) is quite different from conventional SW project cycle
  • 28. ASR: Challenge of Speech Is different for every speaker May be fast, slow, or varying in speed May have high pitch, low pitch, or be whispered Has widely-varying types of environmental noise Changes depending on sequence of phonemes Changes depending on speaking style May not have distinct boundaries between units Changes depending on the semantics of the utterance Has an unlimited number of words
  • 29. Bixby ASR - Fundamentals Language Model(s) voice packet Feature Extraction Decoder Acoustic Model(s) ASR System ASR Hypothesis Inverse Text Normalization
  • 30. Acoustic Model Links Acoustics to Word/phoneme sequence Estimates the likelihood of acoustic sequence given a word/phoneme (LSTM) Language Model Prior on word sequences Probability of a word given the preceding words (n-gram) Decoder Find the best word sequence, i.e. searching for the lowest-cost path in a graph Uses Viterbi algorithm (dynamic programming) Bixby ASR - Fundamentals
  • 31. Bixby ASR Multi Accent United States China India United Kingdom SpainSouth Korea DEFAULT ACCENTED On-Boarding Utterances SIM Card Information Keyboard Language Contact Details Accent Determination Based on: Australia Canada
  • 32. Challenge for Indian Market Hindi targeted as language of experimentation. Indian Languages e.g. Hindi is used in conjunction with English e.g. camera 爐爛爐迦ぞ 爐爐萎 We have developed bi-lingual (English + Hindi) model for Hindi classifier