This document discusses the development of Bixby, Samsung's intelligent virtual assistant. It describes Bixby as being fundamentally different from other assistants due to its ability to seamlessly switch between voice and touch modes, its context awareness, and its ability to understand incomplete commands. The document outlines some of the key challenges in developing Bixby, including managing its massive contextual input space and variable output capabilities across devices and versions. It discusses the use of deep learning and other techniques to address these challenges.
1 of 32
Downloaded 60 times
More Related Content
Samsung voice intelligence.v5.5
1. Taking the Road Less Travelled:
In pursuit of a Multi-modal
experience for Bixby
Samsung R&D Bangalore, India
Dr. Vikram Vij
vikram.v@samsung.com
2. Intelligent Assistants are fast emerging as the next breakthrough
user interface
1990s
Web
2000s
Apps
Today
Assistants
Images references form
3. Evolution of Human Computer Interface
GUI
(~1980s)
Touch UI
(~2000)
Voice
(2011)
Bixby
(2017)
Changes of Interface Paradigm
Voice Assistant Market Research Report
Global Forecast 2023
Reference : https://www.marketresearchfuture.com/reports/voice-assistant-market-4003
4. Bixby Introduction
Bixby is an intelligent, personalized voice interface for your phone.
Its multi-modal - lets you seamlessly switch between voice and touch modes.
o Launch Date : 19th July 2017 (US), 22nd Aug (Global)
o Available in more than 200 countries
o More than 75 Domains supported (Camera, Gallery, Messages, WhatsApp, Youtube, Uber etc. )
o More than 27 million registered users
http://bixby.samsung.com/meet-bixby
https://www.youtube.com/watch?v=dbmVtseEjo4&index=1&list=PLrV44rSVouDcbvky1f77mUjWLCq8WI-Z1
https://www.youtube.com/watch?v=Gcd4NpK2fTI
6. Bixby Overview
Supporting every task of
the application
Understanding the current
context and state of app
Find an
umbrella photo
Manual editing
VOICE
TOUCH
VOICE
1
2
3
Understanding commands
with incomplete info
Send this photo
via message
To whom?
To Jane
Done
Incomplete Command
A true one click action
- Turn on
- Authenticate
- Unlock
- Wake the phone
- Execute the command
Supporting Samsungs
native apps
Request
incomplete.
Error
Show me the Wi-Fi data
usage
Press &
Hold
Bixby is fundamentally different from other voice agents or
assistants in the market because of its ..
Post it on
Instagram
Completeness Context Awareness Cognitive Tolerance Frictionless
7. Bixby - Cognitive Tolerance
ASRIncomplete or inaccurate instructions are also performed under the context..
8. Bixby | Human Computer Interface Revolution
With English Support, Samsung's Bixby Impresses Vs. Siri And
Google Assistant
Bixby is perhaps in the most precarious spot, as its going to be
competing directly against Google Assistant on some devices. Bixbys
capabilities sound quite impressive thanks to its integration with
other Samsung apps
Galaxy S8's voice sidekick can do things Siri can't
9. Bixby v1.0: Minimalistic View
ASR
NLU
voice packet
text input
command
ASR
ASR: Automatic Speech Recognition
NLU: Natural Language Understanding
11. Key Challenges
Design
oText and Voice : Co-existence of Dual Modality
oRepresentation of Massive Input Space
oManagement of Massive Context
oHandling of Variable Output Space
oDesign of Deep Learning Architecture to Achieve this
Data
oManaging the distribution and variations of data
oBalance of Data to maintain the expected distribution of data across different
classes
oSpecial handling for rejection Data
12. Bixby: The Multi-Modal Point of View
Home Settings Connections Data Usage
Touch Interface Voice Interface
+
Show me the mobile data usage
13. Bixby: The Multi-Modal Point of View (contd)
Touch UI
Screen Flow
Voice UI
Find Hawaii photos in Gallery
Context Context Context Context
find James in Contacts application => contact information of James
find James in Gallery application => images tagged as James
14. Leap Required for NLU toward Multi-Modality
Traditional
NLU
Multi-Modal
NLU
Context
Awareness
Massive Number of Contexts Varying Set of Commands
Thousands of states
Note8
S8
TabS
Various device models,
apps, locales,
15. Input Space = (2,000 Contexts) x (Utterances for 6,000 commands)
Challenge of Massive Contextual Input Space
Find James+
Picture View Context
Find James+
Contact View Context
James Picture
James Contact
Static
Classifier
Static
Classifier
Static
Classifier
Static
Classifier
6000+ command classes
Context Space
2000+ contexts
16. Deep Learning was chosen instead of SVMs, Random Forest etc.
Massive number of Classes
Approximately 60 Classes for Domains
Approximately 6K Classes for Intents
Closeness of Domains
The nature of classes are similar
Examples: Reminder, Calendar and Clock
Huge Data
10M data for Domain Classification
1.5M data per Intent Classification (on average per Domain)
Motivation for Deep Learning
Domain
Classification
Intent
Classification
Slot Tagger
Utterance
Slots
Domain Label
Intent Label
17. Approach for Massive Contextual Input Space
Context-conditioned DNN classifier + Sampling
Context-Aware
DNN Classifier
Sampling
6000+ commands
Context + Utterance
context_留 utterance_b command_1+
context_留 utterance_c command_2+
context_留 utterance_a command_1+
context_硫 utterance_b command_2+
context_硫 utterance_c command_2+
context_硫 utterance_a command_1+
Training Set
Input Output
Hierarchical classifier
Session based architecture
Rejection Logic in Intent
18. RNN word model had difficulty in:
Handling unknowns (word misspellings)
Learning word inflections (word boundary going beyond representation)
State based learning
So switched to CNN character model
Challenge of RNN vs CNN
~~~ utt ~~
~~~ utt ~~
.
.
.
~~~ utt ~~
vs
e.g. search for s8 plus goes to calculator domain
e.g. Settings Bluetooth Screen : turn off please
Issue : State is not learnt (Wifi off is detected)
19. Determining the Optimal Filter Size
Smaller filter size used for sub-word level features
Larger filter size used for understanding language structures
Challenge of CNN Filter Size
Multiple filters with various sizes work in parallel
Final layer of CNN which gives best output
Reference : hackerearth.com
20. Challenge of Variable Output Space
App VersionDevice Models Locale
India V 1.1
Turn on Bluetooth tethering
Turn on USB tethering
Turn on tethering
Note8
S8
TabS
Model A
Model B
21. Approach for Variable Output Space
Version Management Mechanism for NLU Engine
Note 8
Country
Installed app info
OS version
Version Metadata
Version mask vectors
V1
V2
V3
Device
Sever
Version DB
NLU Core
Command
Classification
22. Key Learnings - Design
Need to experiment with various DNN Architectures & parameters make
sure experiments have a rationale
Obvious choice of DNN may not work the best in text RNNs typically used
but CNNs proved to be better
Hierarchical design may work better (e.g. text classification)
Feature based matching for intent classes where 100% accuracy is needed
Rule-Based Matching of NER instead of ML/DL based NER
Rejection Based Intent Classification for Close Domains
Can abstract out complexity where possible (e.g. variable output space)
23. Massive Data Flow
Synthetic
Generation of Data
Purchased (3rd Party )
Data
Crawled Data for
Out of Domain
Voice of Customer Data
Quick Grammar Data
DC
Bucketed and annotated
for Single Intent Class
DC and Intent Separated
by Class Levels
Bucketed by Single Intent
Class
Special Data
Market Issues & Bug Fixes
for Intent and Domain
Sampled 2K/Class
Hand-cleanedandConsumedTotal
Hand-cleaned&DownSampled
Sampled2K/Class
ServiceAPILayer
Intent Slot
Sampled 10- ~ 20K/Class
Sampled 10- ~ 20K/Class
Hand-cleaned & Down Sampled
24. Data Governance Training Data
Used Tools to detect & resolve data conflicts across
Domains & Intents
TF-IDF based tool
Cosine similarity based tool
25. Data Governance Test Data
Unit Testing Automation E2E Testing Automation
In- House Automated
Unit Test Tool for
Domain , Intent and
Slot
DEV
Server
Accepted ? Accepted ?
STG
Server
Accepted ?
PRD
Server
Development and
Management of Data
Analysis based on Data Governance Tool
Y Y Y
NNN
End User
VOC Issues
26. Key Learnings - Data
Managing the distribution and variations of data is essential
Quality of Data is critical
o Balance of Data to maintain the expected distribution of data across different classes
o Special handling for rejection Data
A Deep Learning Engineer / Data Scientist must spend 30% of his or her time in
looking at the data
People are needed to manage this volume of data
Tools / Automation need to be developed for pre-processing of data
We can not avoid hand-cleaning or hand-engineering of data
Obvious need for Data Governance as well as Continuous Monitoring of product
quality.
The NLP / ML driven project cycle (including data) is quite different from
conventional SW project cycle
28. ASR: Challenge of Speech
Is different for every speaker
May be fast, slow, or varying in speed
May have high pitch, low pitch, or be whispered
Has widely-varying types of environmental noise
Changes depending on sequence of phonemes
Changes depending on speaking style
May not have distinct boundaries between units
Changes depending on the semantics of the utterance
Has an unlimited number of words
29. Bixby ASR - Fundamentals
Language
Model(s)
voice packet
Feature
Extraction Decoder
Acoustic
Model(s)
ASR
System
ASR
Hypothesis
Inverse
Text
Normalization
30. Acoustic Model
Links Acoustics to Word/phoneme sequence
Estimates the likelihood of acoustic sequence given a
word/phoneme (LSTM)
Language Model
Prior on word sequences
Probability of a word given the preceding words (n-gram)
Decoder
Find the best word sequence, i.e. searching for the
lowest-cost path in a graph
Uses Viterbi algorithm (dynamic programming)
Bixby ASR - Fundamentals
31. Bixby ASR Multi Accent
United States China
India
United Kingdom
SpainSouth Korea
DEFAULT ACCENTED
On-Boarding
Utterances
SIM Card
Information
Keyboard
Language
Contact
Details
Accent Determination
Based on:
Australia
Canada
32. Challenge for Indian Market
Hindi targeted as language of experimentation.
Indian Languages e.g. Hindi is used in conjunction with English
e.g. camera 爐爛爐迦ぞ 爐爐萎
We have developed bi-lingual (English + Hindi) model for Hindi classifier