�ݺ�ߣ

Optimization
��
Hung-yi Lee

Last time ��
small
Small?
median large
Shallow
Deep
A target function ?? to fit
Eventually cover???
??
What is the
difference?
1
2
3
Optimization: Is it possible to
find ?? in the function space.

Optimization
Network: ?? ?
Training data:
?1, ?
?1
?2, ?
?2
??, ?
??
..... ? ? = ?
?=1
?
? ?? ?? ? ?
??
?? = ??? min
?
? ?
Optimization �� Learning
In Deep Learning, ? ? is
not convex.
Non-convex optimization is NP-hard.
Why can we solve the problem by gradient descent?

Loss of Deep Learning
is not convex
There are at least exponentially
many global minima for a neural net.
?
Permutating the neurons in one
layer does not change the loss.

Non-convex �� Difficult
? ?
Not guarantee to find
optimal solution by
gradient descent
? ?
? ?

Outline
Review: Hessian
Deep Linear Model
Deep Non-linear Model
Conjecture about Deep Learning
Empirical Observation about Error Surface

Hessian Matrix:
When Gradient is Zero
Some examples in this part are from:
https://www.math.upenn.edu/~kazdan/312F12/Notes/
max-min-notesJan09/max-min.pdf

Training stops ��.
? People believe training stuck because the
parameters are near a critical point
local minima
How about
saddle point?
http://www.deeplearningbook.org/contents/optimization.html
critical point:
gradient is zero

When Gradient is Zero
? ? = ? ?0 + ? ? ?0 ?? +
1
2
? ? ?0 ?? ? ? ?0 + ?
Gradient g is a vector
Hessian H is a matrix
?? =
?? ?0
???
??? =
?2
??????
? ?0
=
?2
??????
? ?0 = ???
symmetric
?? ?0

Hessian
? ? = ? ?0 + ? ? ?0 ?? +
1
2
? ? ?0 ?? ? ? ?0 + ?
H determines the curvature
Source of image:
http://www.deeplearningbook.org
/contents/numerical.html

Hessian
? ? = ? ?0 + ? ? ?0 ?? +
1
2
? ? ?0 ?? ? ? ?0 + ?
Newton��s method
?? ? �� ? ? ? ?0 ?? + ?
1
2
? ? ?0 ?? ? ? ?0
? ? ? ?0 ??
???
= ??
= ?
?
1
2
? ? ?0 ?? ? ? ?0
???
? ? ? ?0
?? ? = 0
Find the space such that

Hessian
? ? = ? ?0 + ? ? ?0 ?? +
1
2
? ? ?0 ?? ? ? ?0 + ?
?? ? �� ? ? ? ?0 ?? + ?
1
2
? ? ?0 ?? ? ? ?0
= ? ? ? ? ?0
?? ? �� ? + ? ? ? ?0
= 0
? ? ? ?0
= ??
? ? ?0 = ???1?
? = ?0 ? ??1? ? = ?0 ? ??
v.s.
Change the direction, determine step size

Hessian
? ? = ? ?0 + ? ? ?0 ?? +
1
2
? ? ?0 ?? ? ? ?0 + ?
What is the problem?
If ? ? is a quadratic function, obtain critical point in one step.
Consider that ? = ?
Source of image:
https://math.stackexchange.com/questions/60
9680/newtons-method-intuition
?
Not suitable for Deep Learning

Hessian
Source of image:
http://www.offconvex.org/2016/03/22/saddlepoints/
? ? = ? ?0 + ? ? ?0 ?? +
1
2
? ? ?0 ?? ? ? ?0 + ?
At critical point (? = 0)
H tells us the properties of critical points.

Review: Linear Algebra
http://speech.ee.ntu.edu.tw/~tlkagk/courses/LA_2016/Lecture/eigen.pdf
? If ?? = ?? (? is a vector, ? is a scalar)
? ? is an eigenvector of A
? ? is an eigenvalue of A that corresponds to ?
Eigen value
Eigen vector
A must be square
excluding zero vector

Review: Positive/Negative Definite
? An nxn matrix A is symmetric.
? For every non-zero vector x (? �� 0)
positive definite:
positive semi-definite:
negative definite:
negative semi-definite:
???? > 0
???? �� 0
??
?? < 0
???? �� 0
All eigen values
are positive.
All eigen values
are negative.
All eigen values
are non-negative.
All eigen values
are non-positive.
1 0
0 1

Hessian ? ? �� ? ?0 +
1
2
? ? ?0 ?? ? ? ?0
At critical point:
H is positive definite ???? > 0
Local minima
Around ?0: ? ? > ? ?0
All eigen values are positive.
H is negative definite ??
?? < 0
Local maxima
Around ?0: ? ? < ? ?0
All eigen values are negative
???? �� 0?
??
?? �� 0?
Sometimes???? > 0, sometimes???? < 0
Saddle point
????

Hessian ? ? �� ? ?0 +
1
2
? ? ?0 ?? ? ? ?0
At critical point:
? is an eigen vector ??
?? = ?? ?? = ? ? 2
Unit vector = ?
?
?0
+?
Because H is an nxn symmetric matrix,
H can have eigen vectors ?1, ?2, �� , ?? form a orthonormal basis.
? = ?1?1 + ?2?2
?1
?0
+?1
= ?1
2
?1 + ??
2
?2?
?2
+?2
?
????
= ?1?1 + ?2?2
?? ?1?1 + ?2?2
?1 and ?2 are orthogonal
(Ignore 1/2 for
simplicity)

Hessian ? ? �� ? ?0 +
1
2
? ? ?0 ?? ? ? ?0
At critical point:
? is an eigen vector ??
?? = ?? ?? = ? ? 2
Unit vector = ?
?
?0
+?
Because H is an nxn symmetric matrix,
H can have eigen vectors ?1, ?2, �� , ?? form a orthonormal basis.
? = ?1?1 + ?2?2 + ? + ????
?1
2
?1 + ?2
2
?2 + ? + ??
2
??

Examples
? ?, ? = ?2 + 3?2
?? ?, ?
??
= 2?
?? ?, ?
??
= 6?
? = 0, ? = 0
?2
????
? ?, ?
?2
????
? ?, ?
?2
????
? ?, ?
?2
????
? ?, ?
= 2 = 0
= 0 = 6
? =
2 0
0 6
Positive-definite
Local minima

Examples
? ?, ? = ??2 + 3?2
?? ?, ?
??
= ?2?
?? ?, ?
??
= 6?
? = 0, ? = 0
?2
????
? ?, ?
?2
????
? ?, ?
?2
????
? ?, ?
?2
????
? ?, ?
= ?2 = 0
= 0 = 6
? =
?2 0
0 6
Saddle

Degenerate
? Degenerate Hessian has at least one zero eigen value
? ?, ? = ?2 + ?4
?? ?, ?
??
= 2?
?? ?, ?
??
= 4?3
?2
????
? ?, ?
?2
????
? ?, ?
?2
????
? ?, ?
?2
????
? ?, ?
= 2 = 0
= 0 = 12?2
? =
2 0
0 0
? = ? = 0

Degenerate
? Degenerate Hessian has at least one zero eigen value
? ?, ? = ?2 + ?4
? =
2 0
0 0
? =
0
0
? ?, ? = ?2 ? ?4
? =
2 0
0 0
? =
0
0
? = ? = 0 ? = ? = 0
No Difference

Degenerate
? ?, ? = ??4
? ?4
?? ?, ?
??
= ?4?3
?? ?, ?
??
= ?4?3
?2
????
? ?, ?
?2
????
? ?, ?
?2
????
? ?, ?
?2
????
? ?, ?
= ?12?2 = 0
= 0 = ?12?2 ? =
0 0
0 0
? ?, ? = 0
? =
0 0
0 0
? =
0
0
? = ? = 0

Degenerate
http://homepages.math.uic.edu/~juliu
s/monkeysaddle.html
? ?, ? = ?3
? 3??2 ?? ?, ?
??
= 3?2
? 3?2
?? ?, ?
??
= ?6??
?2
????
? ?, ?
?2
????
? ?, ?
?2
????
? ?, ?
?2
????
? ?, ?
= 6? = ?6?
= ?6?
= ?6?
Monkey Saddle
c.f.

Training stuck �� Zero Gradient
? People believe training stuck because the
parameters are around a critical point
!!!
http://www.deeplearningbook.org/contents/optimization.html

Training stuck �� Zero Gradient
http://videolectures.net/deeplearning2015_bengio_theoretical_motivations/
Approach a saddle point, and then escape

https://arxiv.org/abs/1412.6544
?1 ?2
? ? ?
?
= 1 = 1

? = ?
? ? ?1?2? 2
??
??1
= 2 1 ? ?1?2 ??2
??
??2
= 2 1 ? ?1?2 ??1
?2?
??1
2
= 2 ??2 ??2
?2
?
??2
2
= 2 ??1 ??1
?2?
??1??2
= ?2 + 4?1?2
?2
?
??2??1
= ?2 + 4?1?2
?1 ?2
? ? ?
?
= 1 = 1
The probability of stuck as
saddle point is almost zero.
Easy to escape
= 1 ? ?1?2
2

? = 1 ? ?1?2?3
2
??
??1
= 2 1 ? ?1?2?3 ??2?3
?2?
??1
2
= 2 ??2?3
2
?2?
??2??1
= ?2?3 + 4?1?2 ?3
2
2-hidden layers ?1 ?2
? ? ?
?
?3
??
??2
= 2 1 ? ?1?2?3 ??1?3
??
??3
= 2 1 ? ?1?2?3 ??1?2
?1 = ?2 = ?3 = 0
? =
0 0 0
0 0 0
0 0 0
?1 = ?2 = 0, ?3 = ?
? =
0 ?2 0
?2 0 0
0 0 0
Saddle point
So flat
?1?2?3 = 1 global minima
All minima are global, some
critical points are ��bad��.

10 hidden layers
0
0.2
0.4
0.6
0.8
1
1.2
0
0.5
1
1.5
2
2.5
10-hidden layers
?1
?2
origin

Deep Linear Network
x
? = ??
???1
? ?2
?1
?
y WK W1
W2
��..
? = ?
?=1
?
?? ? ?
?? 2
?
?
Hidden layer size �� Input dim, output dim
More than two hidden layers can produce
saddle point without negative eigenvalues.

Reference
? Kenji Kawaguchi, Deep Learning without Poor Local Minima, NIPS, 2016
? Haihao Lu, Kenji Kawaguchi, Depth Creates No Bad Local Minima, arXiv, 2017
? Thomas Laurent, James von Brecht, Deep linear neural networks with arbitrary
loss: All local minima are global, arXiv, 2017
? Maher Nouiehed, Meisam Razaviyayn, Learning Deep Models: Critical Points
and Local Openness, arXiv, 2018

Non-linear Deep Network
Does it have local minima?
�C��鲻��ں��y��C��
��x��Ӽ�ͬ�W�l�FͶӰƬ�ϵ��e��

Even Simple Task can be Difficult

ReLU has local
+
1
?
+
1
?
+
1
?
+
1
? 1
0
-3
1
+
1
?
+
1
? 1
0
-4
-7
(-1.3) (1,-3) (3,0) (4,1) (5,2)
This relu network has local minima.

��Blind Spot�� of ReLU
x
y
0
0
0
0
0
0
0
0
0
Gradient
is zero
It is pretty easy to make this happens ��

��Blind Spot�� of ReLU
? MNIST, Adam, 1M updates
Consider your
initialization

k neurons
?1 +
+
��
+
?2
��
Label
generator
?
?
+
If ? �� ? and
?? = ??
n neurons
?1 +
+
��
?
+
The number of
k and n matters
We obtain
global minima
? +
?2
��
?1
?2
??
?1 ??
?2 ??
?? ??
1
1
1
?
?1
?2
??
?1 ??
?2 ?
?
?? ?
?
1
1
1
Considering Data
Network to
be trained
N(0,1)

No local for ? �� ? + 2
Considering Data

Reference
? Grzegorz Swirszcz, Wojciech Marian Czarnecki, Razvan Pascanu, ��Local
minima in training of neural networks��, arXiv, 2016
? Itay Safran, Ohad Shamir, ��Spurious Local Minima are Common in Two-Layer
ReLU Neural Networks��, arXiv, 2017
? Yi Zhou, Yingbin Liang, ��Critical Points of Neural Networks: Analytical Forms
and Landscape Properties��, arXiv, 2017
? Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah, ��Failures of Gradient-
Based Deep Learning��, arXiv, 2017
The theory should looks like ��
Under some conditions (initialization, data, ��),
We can find global optimal.

Conjecture
about Deep Learning
Almost all local minimum have very similar loss to the global
optimum, and hence finding a local minimum is good enough.

Analyzing Hessian
? When we meet a critical point, it can be saddle point or
local minima.
? Analyzing H
If the network has N parameters
?1 ?2 ?3 ??
��
We assume ? has 1/2 (?) to be positive, 1/2 (?) to be negative.
?1 ?2 ?3 ??
��

Analyzing Hessian
? If N=1:
? If N=2:
? If N=10:
?1
?1
1/2 local minima, 1/2 local maxima,
Saddle point is almost impossible
?1
?1
?2
?2
1/2 Saddle points
Almost every critical point is saddle point
When a network is very large,
It is almost impossible to meet a local minima.
Saddle point is what you need to worry about.
+ + - -
+ -, - +

Error v.s. Eigenvalues
Source of image:
http://proceedings.mlr.press/v70/pennington17a/pennington17a.pdf
We assume ? has 1/2 (?)
to be negative.
p is a probability
related to error
Larger error, larger p
p

Guess about Error Surface
https://stats385.github.io/assets/lectures/Understanding_and_improving_deep_lea
ring_with_random_matrix_theory.pdf
global minima local minima
saddle
(good enough)

Training Error v.s. Eigenvalues

Training Error v.s. Eigenvalues
Portion of positive eigenvalues ��Degree of Local Minima��
1 - ��degree of local minima��
(portion of negative eigen values)

? ��
?
?
? 1
3/2
1
-
��degree
of
local
minima��
empirical
theoretical
1 - ��degree of local minima��
(portion of negative eigen values)

Spin Glass v.s. Deep Learning
? Deep learning is the same as spin glass model with
7 assumptions.
spin glass model network

More Theory
? If the size of network is
large enough, we can find
global optimal by gradient
descent
? Independent to
initialization

Reference
? Razvan Pascanu, Yann N. Dauphin, Surya Ganguli, Yoshua Bengio, On the saddle
point problem for non-convex optimization, arXiv, 2014
? Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya
Ganguli, Yoshua Bengio, ��Identifying and attacking the saddle point problem in
high-dimensional non-convex optimization��, NIPS, 2014
? Anna Choromanska, Mikael Henaff, Michael Mathieu, G��rard Ben Arous, Yann
LeCun, ��The Loss Surfaces of Multilayer Networks��, PMLR, 2015
? Jeffrey Pennington, Yasaman Bahri, ��Geometry of Neural Network Loss Surfaces
via Random Matrix Theory��, PMLR, 2017
? Benjamin D. Haeffele, Rene Vidal, �� Global Optimality in Neural Network
Training��, CVPR, 2017

What does the Error
Surface look like?

Profile
?0
??
?0 + 2 ?? ? ?0
local minima is rare?

Profile
two random starting points two ��solutions��

Optimization Optimization Optimi(v6).pdf

Training Processing
6-layer CNN on
CIFAR-10
Different initialization / different strategies usually
lead to similar loss (there are some exceptions).
Different
initialization

Training Processing
? Different strategies (the same initialization)

Training Processing
�Εr�ֵ��P�s��
Different training strategies
Different basins
http://mypaper.pchome.com.tw
/ccschoolgeo/post/1311484084

Training Processing
? Training strategies make difference at all stages of
training

Reference
? Ian J. Goodfellow, Oriol Vinyals, Andrew M. Saxe, ��Qualitatively
characterizing neural network optimization problems��, ICLR 2015
? Daniel Jiwoong Im, Michael Tao, Kristin Branson, ��An Empirical Analysis of
Deep Network Loss Surfaces��, arXiv 2016
? Qianli Liao, Tomaso Poggio, ��Theory II: Landscape of the Empirical Risk in
Deep Learning��, arXiv 2017
? Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein, ��Visualizing the
Loss Landscape of Neural Nets��, arXiv 2017

Concluding Remarks
? Deep linear network is not convex, but all the local minima
are global minima.
? There are saddle points which are hard to escape
? Deep network has local minima.
? We need more theory in the future
? Conjecture:
? When training a larger network, it is rare to meet local
minima.
? All local minima are almost as good as global
? We can try to understand the error surface by visualization.
? The error surface is not as complexed as imagined.

�ݺ�ߣ

Optimization Optimization Optimi(v6).pdf

More Related Content

Optimization Optimization Optimi(v6).pdf