�ݺ�ߣ

Scaling Deep Learning
Bryan Catanzaro
@ctnzr

Bryan Catanzaro
What do we want AI to do?
Drive us to work
Serve drinks?
Help us
communicate
��ǹ�ͨ
Keep us
organized
Help us find
things
Guide us to
content

Bryan Catanzaro
OCR-based Translation App
Baidu IDL
hello

Bryan Catanzaro
Face Analysis
Baidu IDL
Gender
Age Range
Ethnicity
Mood

Bryan Catanzaro
Medical Diagnostics App
Baidu BDL
AskADoctor can assess
520 different diseases,
representing ~90 percent
of the most common
medical problems.

Bryan Catanzaro
Image Captioning
Baidu IDL
A yellow bus driving down a road
with green trees and green grass in
the background.
Living room with white couch and
blue carpeting. Room in apartment
gets some afternoon sun.

Bryan Catanzaro
Image Q&A
Baidu IDL
Sample questions and answers

Bryan Catanzaro
Natural User Interfaces
? Goal: Make interacting with computers as
natural as interacting with humans
? AI problems:
�C Speech recognition
�C Emotional recognition
�C Semantic understanding
�C Dialog systems
�C Speech synthesis

Bryan Catanzaro
Machine learning for computer vision (c.
2009)
��Please put away the coffee mugs!��

Bryan Catanzaro
Machine learning for computer vision
��Mug��
Machine Learning
Cleanup-bot!
(Woohoo!)

Bryan Catanzaro
AI applications are hard��

Bryan Catanzaro
AI applications are hard��
Machine Learning can solve challenging problems
--- but it is a lot of work!
This eventually worked ~95% of the time.

Bryan Catanzaro
Why are applications so hard?
��Coffee Mug��
Pixel Intensity
Pixel intensity is a very difficult representation��

Bryan Catanzaro
pixel 1
pixel 2
Coffee Mug
Not Coffee Mug
pixel 1
pixel 2
Pixel Intensity[72 160]
-+
+
-
+
-

Bryan Catanzaro
+
pixel 1
pixel 2
-
+
+
-
-
+ -
+
+Coffee Mug
Not Coffee Mug-
+
pixel 1
pixel 2
-
+
+
-
-
+ -
+
Is this a Coffee Mug?
Learning Algorithm

Bryan Catanzaro
Features
+
handle?
cylinder?
-
+
+-
-
+
-
+
+Coffee Mug
Not Coffee Mug-
cylinder?handle?
Is this a Coffee Mug?
Learning Algorithm +
handle?
cylinder?
-
+
-
-
+
-
++

Bryan Catanzaro
Machine learning in practice
��Mug
�� Machine
Learning
(Classifier)
Feature
Extraction

Bryan Catanzaro
��Mug
�� Machine
Learning
(Classifier)
Feature
Extraction
Prior Knowledge
Experience

Bryan Catanzaro
? Enormous amounts of research time spent
inventing new features.
Idea
CodeTest
Hack up in Matlab
Run on workstation
Think really hard��

Bryan Catanzaro
Learning features
? Deep learning: learn multiple stages of
features to achieve end goal.
Features Features ��Mug��?Classifier
Pixels

Bryan Catanzaro
Learning features
? ��Neural networks�� are one way to represent
features
Features Features ��Mug��?
Classif
ier
Pixels
y = g(W x)
x
y
W

Bryan Catanzaro
Learning features
? Deep learning: learn multiple stages of
features to achieve end goal
��Mug��?
Pixels Features
Features
Classifier
W3
W2
W1

Bryan Catanzaro
Why Deep Learning?
1. Scale Matters
�C Bigger models usually win
2. Data Matters
�C More data means less
cleverness necessary
3. Productivity Matters
�C Teams with better tools can try out more ideas
Data & Compute
Accuracy
Deep Learning
Many previou
methods

Bryan Catanzaro
Scaling up
? Make progress on AI by focusing on systems
�C Make models bigger
�C Tackle more data
�C Reduce research cycle time
? Accelerate large-scale
experiments

Bryan Catanzaro
Exascale
? Strong scaling important
but difficult
�C Weak scaling over time as
datasets increase
? We run our experiments
on 8-128 GPUs
? Exascale likely important
for running many ��small��
experiments

Bryan Catanzaro
Training Deep Neural Networks
? Computation dominated by dot products
? Multiple inputs, multiple outputs, batch
means GEMM
�C Compute bound
? Convolutional layers even more compute
bound

Bryan Catanzaro
Computational Characteristics
? High arithmetic intensity
�C Arithmetic operations / byte of data
�C O(Exaflops) / O(Terabytes) : 10^6
? In contrast, many other ML training jobs are
O(Petaflops)/O(Petabytes) = 10^0
? Medium size datasets
�C Generally fit on 1 node
�C HDFS, fault tolerance, disk I/O not bottlenecks
Training 1 model: ~10 Exaflops

Bryan Catanzaro
Deep Neural Network training is HPC
Idea
CodeTest
? Turnaround time is key
? Use most efficient hardware
�C Parallel, heterogeneous computing
�C Fast interconnect (PCIe, Infiniband)
? Push strong scalability
�C Models and data have to be of commensurate size
? This is all standard HPC!

Bryan Catanzaro
Training: Stochastic Gradient Descent
? Simple algorithm
�C Add momentum to power through local minima
�C Compute gradient by backpropagation
? Operates on minibatches
�C This makes it a GEMM problem instead of GEMV
? Choose minibatches stochastically
�C Important to avoid memorizing training order
? Difficult to parallelize
�C Prefers lots of small steps
�C Increasing minibatch size not always helpful

Bryan Catanzaro
Limitations of batching
Error
Iterations
Batch size = ?
Batch size = 2?
Spending 2x the work picking a direction
Doesn��t reduce iteration count by 2x

Bryan Catanzaro
SVAIL Infrastructure
1
http://www.tyan.com
FT77CB7079
Service Engineer��s Manual
NVIDIA GeForce
GTX Titan X
Titan X x8
Mellanox Interconnect
? Software: CUDA, MPI, Majel (SVAIL internal
library)
? Hardware:

Bryan Catanzaro
Node Architecture
? All pairs of GPUs communicate
simultaneously over PCIe Gen 3 x16
? Groups of 4 GPUs form Peer to Peer domain
? Avoid moving data to CPUs or across QPI

Bryan Catanzaro
Parallelism
Model Parallel
Data Parallel
MPI_Allreduce()
Training Data Training Data

Bryan Catanzaro
Speech Recognition: Traditional ASR
? Getting higher performance is hard
? Improve each stage by engineering
Accuracy
Traditional ASR
Data + Model Size
Expert engineering.
Adam
Coates

Bryan Catanzaro
Speech recognition: Traditional ASR
? Huge investment in features for speech!
�C Decades of work to get very small improvements
Spectrogram MFCC Flux

Bryan Catanzaro
Speech Recognition 2: Deep Learning!
? Since 2011, deep learning for features
AcousticModel
HMM
Language
Model
Transcription
��The quick brown fox
jumps over the lazy
dog.��

Bryan Catanzaro
Speech Recognition 2: Deep Learning!
? With more data, DL acoustic models perform
better than traditional models
Accuracy
Traditional ASR
Data + Model Size
DL V1 for Speech

Bryan Catanzaro
Speech Recognition 3: ��Deep Speech��
? End-to-end learning
��The quick brown fox
jumps over the lazy
dog.��
Transcription

Bryan Catanzaro
Speech Recognition 3: ��Deep Speech��
? We believe end-to-end DL works better
when we have big models and
lots of data
Accuracy
Traditional ASR
Data + Model Size
DL V1 for Speech
Deep Speech

Bryan Catanzaro
End-to-end speech with DL
? Deep neural network predicts characters directly
from audio
. . .
. . .
T H _ E �� D O G

Bryan Catanzaro
Recurrent Network
? RNNs model temporal dependence
? Various flavors used in many applications
�C LSTM, GRU, Bidirectional, ��
�C Especially sequential data (time series, text, etc.)
? Sequential dependence complicates
parallelism

Bryan Catanzaro
Connectionist Temporal Classification

Bryan Catanzaro
warp-ctc
? Recently open sourced our CTC
implementation
? Efficient, parallel CPU and GPU backend
? 100-400X faster than other implementations
? Apache license, C interfacehttps://github.com/baidu-research/warp-ctc

Bryan Catanzaro
Training sets
? Train on ~1? years of data (and growing)
? English and Mandarin
? End-to-end deep learning is key to
assembling large datasets
? Datasets drive accuracy

Bryan Catanzaro
All-reduce
? We implemented our own all-reduce out of
send and receive
? Several algorithm choices based on size
? Careful attention to affinity and topology

Bryan Catanzaro
Scalability
? Batch size is hard to increase
�C algorithm, memory limits
? Performance at small batch sizes (32, 64)
leads to scalability limits

Bryan Catanzaro
Performance for RNN training
? 55% of GPU FMA peak using a single GPU
? ~48% of peak using 8 GPUs in one node
? Weak scaling very efficient, albeit algorithmically
challenged
1
2
4
8
16
32
64
128
256
512
1 2 4 8 16 32 64 128
TFLOP/s
Number of GPUs
Typical
training run
one node multi node

Bryan Catanzaro
Precision
? FP16 mostly works
�C Use FP32 for softmax and weight updates
? More sensitive to labeling error
1
10
100
1000
10000
100000
1000000
10000000
100000000
-31
-30
-29
-28
-27
-26
-25
-24
-23
-22
-21
-20
-19
-18
-17
-16
-15
-14
-13
-12
-11
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
0
Count
Magnitude
Weight Distribution

Bryan Catanzaro
Determinism
? Determinism very important
? So much randomness,
hard to tell if you have a bug
? Networks train despite bugs,
although accuracy impaired
? Reproducibility is important
�C For the usual scientific reasons
�C Progress not possible without reproducibility
? We use synchronous SGD

Bryan Catanzaro
Conclusion
? Deep Learning is solving many hard
problems
? Training deep neural networks is an HPC
problem
? Scaling brings AI progress!

Bryan Catanzaro
Thanks
? Andrew Ng, Adam Coates, Awni Hannun,
Patrick LeGresley �� and all of SVAIL
Bryan Catanzaro
@ctnzr

�ݺ�ߣ

HPC Advisory Council Stanford Conference 2016

More Related Content

HPC Advisory Council Stanford Conference 2016

Editor's Notes