�ݺ�ߣ

Taking the Road Less Travelled:
In pursuit of a Multi-modal
experience for Bixby
Samsung R&D Bangalore, India
Dr. Vikram Vij
vikram.v@samsung.com

Intelligent Assistants are fast emerging as the next breakthrough
user interface
1990s
Web
2000s
Apps
Today
Assistants
Images references form

Evolution of Human Computer Interface
GUI
(~1980s)
Touch UI
(~2000)
Voice
(2011)
Bixby
(2017)
Changes of Interface Paradigm
Voice Assistant Market Research Report
Global Forecast 2023
Reference : https://www.marketresearchfuture.com/reports/voice-assistant-market-4003

Bixby Introduction
Bixby is an intelligent, personalized voice interface for your phone.
Its multi-modal - lets you seamlessly switch between voice and touch modes.
o Launch Date : 19th July 2017 (US), 22nd Aug (Global)
o Available in more than 200 countries
o More than 75 Domains supported (Camera, Gallery, Messages, WhatsApp, Youtube, Uber etc. )
o More than 27 million registered users
http://bixby.samsung.com/meet-bixby
https://www.youtube.com/watch?v=dbmVtseEjo4&index=1&list=PLrV44rSVouDcbvky1f77mUjWLCq8WI-Z1
https://www.youtube.com/watch?v=Gcd4NpK2fTI

Bixby Overview
Supporting every task of
the application
Understanding the current
context and state of app
Find an
umbrella photo
Manual editing
VOICE
TOUCH
VOICE
1
2
3
Understanding commands
with incomplete info
Send this photo
via message
To whom?
To Jane
Done
“Incomplete Command”
A true one click action
- Turn on
- Authenticate
- Unlock
- Wake the phone
- Execute the command
Supporting Samsung’s
native apps
……
Request
incomplete.
Error
“Show me the Wi-Fi data
usage”
Press &
Hold
Bixby is fundamentally different from other voice agents or
assistants in the market because of its ..
Post it on
Instagram
Completeness Context Awareness Cognitive Tolerance Frictionless

Bixby - Cognitive Tolerance
ASRIncomplete or inaccurate instructions are also performed under the context..

Bixby | Human Computer Interface Revolution
With English Support, Samsung's Bixby Impresses Vs. Siri And
Google Assistant
Bixby is perhaps in the most precarious spot, as it’s going to be
competing directly against Google Assistant on some devices. Bixby’s
capabilities sound quite impressive thanks to its integration with
other Samsung apps
Galaxy S8's voice sidekick can do things Siri can't

Bixby v1.0: Minimalistic View
ASR
NLU
voice packet
text input
command
ASR
ASR: Automatic Speech Recognition
NLU: Natural Language Understanding

Traditional NLU Flow
NLU
Platform
mom
Text to Mom Machine Learning Models
Command
Domain
Classifier
Intent
Classifier
Slot
Tagger
Messages Send Message “Mom”

Key Challenges
Design
oText and Voice : Co-existence of Dual Modality
oRepresentation of Massive Input Space
oManagement of Massive Context
oHandling of Variable Output Space
oDesign of Deep Learning Architecture to Achieve this
Data
oManaging the distribution and variations of data
oBalance of Data to maintain the expected distribution of data across different
classes
oSpecial handling for rejection Data

Bixby: The Multi-Modal Point of View
① Home ② Settings ③ Connections ③ Data Usage
Touch Interface Voice Interface
+
“Show me the mobile data usage”

Bixby: The Multi-Modal Point of View (cont’d)
Touch UI
Screen Flow
Voice UI
“Find Hawaii photos in Gallery”
Context Context Context Context
“find James” in Contacts application => contact information of James
“find James” in Gallery application => images tagged as James

Leap Required for NLU toward Multi-Modality
Traditional
NLU
Multi-Modal
NLU
Context
Awareness
Massive Number of Contexts Varying Set of Commands
…
…
…
…
…
Thousands of states
Note8 …
…
…
…
…
…
S8
TabS
Various device models,
apps, locales, …

Input Space = (2,000 Contexts) x (Utterances for 6,000 commands)
Challenge of Massive Contextual Input Space
“Find James”+
Picture View Context
“Find James”+
Contact View Context
James’ Picture
James’ Contact
…
Static
Classifier
Static
Classifier
Static
Classifier
Static
Classifier
…
…
…
…
…
…
6000+ command classes
Context Space
2000+ contexts

Deep Learning was chosen instead of SVMs, Random Forest etc.
• Massive number of Classes
• Approximately 60 Classes for Domains
• Approximately 6K Classes for Intents
• Closeness of Domains
• The nature of classes are similar
• Examples: Reminder, Calendar and Clock
• Huge Data
• 10M data for Domain Classification
• 1.5M data per Intent Classification (on average per Domain)
Motivation for Deep Learning
Domain
Classification
Intent
Classification
Slot Tagger
Utterance
… …
… …
Slots
Domain Label
Intent Label

Approach for Massive Contextual Input Space
Context-conditioned DNN classifier + Sampling
Context-Aware
DNN Classifier
Sampling
6000+ commands
Context + Utterance
context_α utterance_b  command_1+
context_α utterance_c  command_2+
…
context_α utterance_a  command_1+
context_β utterance_b  command_2+
context_β utterance_c  command_2+
…
context_β utterance_a  command_1+
…
…
…
Training Set
Input Output
Hierarchical classifier
Session based architecture
Rejection Logic in Intent

• RNN word model had difficulty in:
• Handling unknowns (word misspellings)
• Learning word inflections (word boundary going beyond representation)
• State based learning
• So switched to CNN character model
Challenge of RNN vs CNN
~~~ utt ~~
~~~ utt ~~
.
.
.
~~~ utt ~~
vs
e.g. “search for s8 plus” goes to calculator domain
e.g. Settings Bluetooth Screen : “turn off please”
Issue : State is not learnt (Wifi off is detected)

• Determining the Optimal Filter Size
• Smaller filter size used for sub-word level features
• Larger filter size used for understanding language structures
Challenge of CNN Filter Size
Multiple filters with various sizes work in parallel
Final layer of CNN which gives best output
Reference : hackerearth.com

Challenge of Variable Output Space
App VersionDevice Models Locale
India V 1.1
…
…
…
Turn on Bluetooth tethering
Turn on USB tethering
Turn on tethering
Note8 …
…
…
…
…
…
S8
TabS
Model A
Model B

Approach for Variable Output Space
Version Management Mechanism for NLU Engine
Note 8
Country
Installed app info
OS version
Version Metadata
…
Version mask vectors
V1 …
…
…
…
…
…
V2
V3
Device
Sever
Version DB
NLU Core
Command
Classification

Key Learnings - Design
• Need to experiment with various DNN Architectures & parameters – make
sure experiments have a rationale
• Obvious choice of DNN may not work the best – in text RNNs typically used
but CNNs proved to be better
• Hierarchical design may work better (e.g. text classification)
• Feature based matching for intent classes where 100% accuracy is needed
• Rule-Based Matching of NER instead of ML/DL based NER
• Rejection Based Intent Classification for Close Domains
• Can abstract out complexity where possible (e.g. variable output space)

Massive Data Flow
Synthetic
Generation of Data
Purchased (3rd Party )
Data
Crawled Data for
Out of Domain
Voice of Customer Data
Quick Grammar Data
DC
Bucketed and annotated
for Single Intent Class
DC and Intent Separated
by Class Levels
Bucketed by Single Intent
Class
Special Data
Market Issues & Bug Fixes
for Intent and Domain
Sampled 2K/Class
Hand-cleanedandConsumedTotal
Hand-cleaned&DownSampled
Sampled2K/Class
ServiceAPILayer
Intent Slot
Sampled 10- ~ 20K/Class
Sampled 10- ~ 20K/Class
Hand-cleaned & Down Sampled

Data Governance – Training Data
Used Tools to detect & resolve data conflicts across
Domains & Intents
• TF-IDF based tool
• Cosine similarity based tool

Data Governance – Test Data
Unit Testing Automation E2E Testing Automation
In- House Automated
Unit Test Tool for
Domain , Intent and
Slot
DEV
Server
Accepted ? Accepted ?
STG
Server
Accepted ?
PRD
Server
Development and
Management of Data
Analysis based on Data Governance Tool
Y Y Y
NNN
End User
VOC Issues

Key Learnings - Data
• Managing the distribution and variations of data is essential
• Quality of Data is critical
o Balance of Data to maintain the expected distribution of data across different classes
o Special handling for rejection Data
• A Deep Learning Engineer / Data Scientist must spend 30% of his or her time in
looking at the data
• People are needed to manage this volume of data
• Tools / Automation need to be developed for pre-processing of data
• We can not avoid hand-cleaning or hand-engineering of data
• Obvious need for Data Governance as well as Continuous Monitoring of product
quality.
• The NLP / ML driven project cycle (including data) is quite different from
conventional SW project cycle

Samsung voice intelligence.v5.5

ASR: Challenge of Speech
Is different for every speaker
May be fast, slow, or varying in speed
May have high pitch, low pitch, or be whispered
Has widely-varying types of environmental noise
Changes depending on sequence of phonemes
Changes depending on speaking style
May not have distinct boundaries between units
Changes depending on the semantics of the utterance
Has an unlimited number of words

Bixby ASR - Fundamentals
Language
Model(s)
voice packet
Feature
Extraction Decoder
Acoustic
Model(s)
ASR
System
ASR
Hypothesis
Inverse
Text
Normalization

• Acoustic Model
• Links Acoustics to Word/phoneme sequence
• Estimates the likelihood of acoustic sequence given a
word/phoneme (LSTM)
• Language Model
• Prior on word sequences
• Probability of a word given the preceding words (n-gram)
• Decoder
• Find the best word sequence, i.e. searching for the
lowest-cost path in a graph
• Uses Viterbi algorithm (dynamic programming)
Bixby ASR - Fundamentals

Bixby ASR – Multi Accent
United States China
India
United Kingdom
SpainSouth Korea
DEFAULT ACCENTED
On-Boarding
Utterances
SIM Card
Information
Keyboard
Language
Contact
Details
Accent Determination
Based on:
Australia
Canada

Challenge for Indian Market
• Hindi targeted as language of experimentation.
• Indian Languages e.g. Hindi is used in conjunction with English
e.g. camera खुला करो
• We have developed bi-lingual (English + Hindi) model for Hindi classifier

�ݺ�ߣ

Samsung voice intelligence.v5.5

More Related Content

Samsung voice intelligence.v5.5