2. Last time ¡
small
Small?
median large
Shallow
Deep
A target function ?? to fit
Eventually cover???
??
What is the
difference?
1
2
3
Optimization: Is it possible to
find ?? in the function space.
3. Optimization
Network: ?? ?
Training data:
?1, ?
?1
?2, ?
?2
??, ?
??
..... ? ? = ?
?=1
?
? ?? ?? ? ?
??
?? = ??? min
?
? ?
Optimization ¡Ù Learning
In Deep Learning, ? ? is
not convex.
Non-convex optimization is NP-hard.
Why can we solve the problem by gradient descent?
4. Loss of Deep Learning
is not convex
There are at least exponentially
many global minima for a neural net.
?
Permutating the neurons in one
layer does not change the loss.
7. Hessian Matrix:
When Gradient is Zero
Some examples in this part are from:
https://www.math.upenn.edu/~kazdan/312F12/Notes/
max-min-notesJan09/max-min.pdf
8. Training stops ¡.
? People believe training stuck because the
parameters are near a critical point
local minima
How about
saddle point?
http://www.deeplearningbook.org/contents/optimization.html
critical point:
gradient is zero
9. When Gradient is Zero
? ? = ? ?0 + ? ? ?0 ?? +
1
2
? ? ?0 ?? ? ? ?0 + ?
Gradient g is a vector
Hessian H is a matrix
?? =
?? ?0
???
??? =
?2
??????
? ?0
=
?2
??????
? ?0 = ???
symmetric
?? ?0
13. Hessian
? ? = ? ?0 + ? ? ?0 ?? +
1
2
? ? ?0 ?? ? ? ?0 + ?
Newton¡¯s method
What is the problem?
If ? ? is a quadratic function, obtain critical point in one step.
Consider that ? = ?
Source of image:
https://math.stackexchange.com/questions/60
9680/newtons-method-intuition
?
Not suitable for Deep Learning
16. Review: Positive/Negative Definite
? An nxn matrix A is symmetric.
? For every non-zero vector x (? ¡Ù 0)
positive definite:
positive semi-definite:
negative definite:
negative semi-definite:
???? > 0
???? ¡Ý 0
??
?? < 0
???? ¡Ü 0
All eigen values
are positive.
All eigen values
are negative.
All eigen values
are non-negative.
All eigen values
are non-positive.
1 0
0 1
17. Hessian ? ? ¡Ö ? ?0 +
1
2
? ? ?0 ?? ? ? ?0
At critical point:
H is positive definite ???? > 0
Local minima
Around ?0: ? ? > ? ?0
All eigen values are positive.
H is negative definite ??
?? < 0
Local maxima
Around ?0: ? ? < ? ?0
All eigen values are negative
???? ¡Ý 0?
??
?? ¡Ü 0?
Sometimes???? > 0, sometimes???? < 0
Saddle point
????
18. Hessian ? ? ¡Ö ? ?0 +
1
2
? ? ?0 ?? ? ? ?0
At critical point:
? is an eigen vector ??
?? = ?? ?? = ? ? 2
Unit vector = ?
?
?0
+?
Because H is an nxn symmetric matrix,
H can have eigen vectors ?1, ?2, ¡ , ?? form a orthonormal basis.
? = ?1?1 + ?2?2
?1
?0
+?1
= ?1
2
?1 + ??
2
?2?
?2
+?2
?
????
= ?1?1 + ?2?2
?? ?1?1 + ?2?2
?1 and ?2 are orthogonal
(Ignore 1/2 for
simplicity)
19. Hessian ? ? ¡Ö ? ?0 +
1
2
? ? ?0 ?? ? ? ?0
At critical point:
? is an eigen vector ??
?? = ?? ?? = ? ? 2
Unit vector = ?
?
?0
+?
Because H is an nxn symmetric matrix,
H can have eigen vectors ?1, ?2, ¡ , ?? form a orthonormal basis.
? = ?1?1 + ?2?2 + ? + ????
?1
2
?1 + ?2
2
?2 + ? + ??
2
??
26. Training stuck ¡Ù Zero Gradient
? People believe training stuck because the
parameters are around a critical point
!!!
http://www.deeplearningbook.org/contents/optimization.html
27. Training stuck ¡Ù Zero Gradient
http://videolectures.net/deeplearning2015_bengio_theoretical_motivations/
Approach a saddle point, and then escape
34. Deep Linear Network
x
? = ??
???1
? ?2
?1
?
y WK W1
W2
¡..
? = ?
?=1
?
?? ? ?
?? 2
?
?
Hidden layer size ¡Ý Input dim, output dim
More than two hidden layers can produce
saddle point without negative eigenvalues.
35. Reference
? Kenji Kawaguchi, Deep Learning without Poor Local Minima, NIPS, 2016
? Haihao Lu, Kenji Kawaguchi, Depth Creates No Bad Local Minima, arXiv, 2017
? Thomas Laurent, James von Brecht, Deep linear neural networks with arbitrary
loss: All local minima are global, arXiv, 2017
? Maher Nouiehed, Meisam Razaviyayn, Learning Deep Models: Critical Points
and Local Openness, arXiv, 2018
36. Non-linear Deep Network
Does it have local minima?
×CÃ÷ÊÂÇé²»´æÔÚºÜëy£¬×CÃ÷ÊÂÇé´æÔÚÏàŒ¦ÈÝÒ×
¸ÐÖxÔø×Ó¼ÒͬŒW°l¬FͶӰƬÉϵÄåe×Ö
39. ¡°Blind Spot¡± of ReLU
x
y
0
0
0
0
0
0
0
0
0
Gradient
is zero
It is pretty easy to make this happens ¡¡
40. ¡°Blind Spot¡± of ReLU
? MNIST, Adam, 1M updates
Consider your
initialization
41. k neurons
?1 +
+
¡
+
?2
¡¡
Label
generator
?
?
+
If ? ¡Ý ? and
?? = ??
n neurons
?1 +
+
¡
?
+
The number of
k and n matters
We obtain
global minima
? +
?2
¡¡
?1
?2
??
?1 ??
?2 ??
?? ??
1
1
1
?
?1
?2
??
?1 ??
?2 ?
?
?? ?
?
1
1
1
Considering Data
Network to
be trained
N(0,1)
44. Reference
? Grzegorz Swirszcz, Wojciech Marian Czarnecki, Razvan Pascanu, ¡°Local
minima in training of neural networks¡±, arXiv, 2016
? Itay Safran, Ohad Shamir, ¡°Spurious Local Minima are Common in Two-Layer
ReLU Neural Networks¡±, arXiv, 2017
? Yi Zhou, Yingbin Liang, ¡°Critical Points of Neural Networks: Analytical Forms
and Landscape Properties¡±, arXiv, 2017
? Shai Shalev-Shwartz, Ohad Shamir, Shaked Shammah, ¡°Failures of Gradient-
Based Deep Learning¡±, arXiv, 2017
The theory should looks like ¡
Under some conditions (initialization, data, ¡¡),
We can find global optimal.
45. Conjecture
about Deep Learning
Almost all local minimum have very similar loss to the global
optimum, and hence finding a local minimum is good enough.
46. Analyzing Hessian
? When we meet a critical point, it can be saddle point or
local minima.
? Analyzing H
If the network has N parameters
?1 ?2 ?3 ??
¡¡
We assume ? has 1/2 (?) to be positive, 1/2 (?) to be negative.
?1 ?2 ?3 ??
¡¡
47. Analyzing Hessian
? If N=1:
? If N=2:
? If N=10:
?1
?1
1/2 local minima, 1/2 local maxima,
Saddle point is almost impossible
?1
?1
?2
?2
1/4 local minima, 1/4 local maxima,
1/2 Saddle points
1/1024 local minima, 1/1024 local maxima,
Almost every critical point is saddle point
When a network is very large,
It is almost impossible to meet a local minima.
Saddle point is what you need to worry about.
+ + - -
+ -, - +
48. Error v.s. Eigenvalues
Source of image:
http://proceedings.mlr.press/v70/pennington17a/pennington17a.pdf
We assume ? has 1/2 (?)
to be negative.
p is a probability
related to error
Larger error, larger p
p
49. Guess about Error Surface
https://stats385.github.io/assets/lectures/Understanding_and_improving_deep_lea
ring_with_random_matrix_theory.pdf
global minima local minima
saddle
(good enough)
51. Training Error v.s. Eigenvalues
Portion of positive eigenvalues ¡°Degree of Local Minima¡±
1 - ¡°degree of local minima¡±
(portion of negative eigen values)
53. Spin Glass v.s. Deep Learning
? Deep learning is the same as spin glass model with
7 assumptions.
spin glass model network
54. More Theory
? If the size of network is
large enough, we can find
global optimal by gradient
descent
? Independent to
initialization
55. Reference
? Razvan Pascanu, Yann N. Dauphin, Surya Ganguli, Yoshua Bengio, On the saddle
point problem for non-convex optimization, arXiv, 2014
? Yann Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya
Ganguli, Yoshua Bengio, ¡°Identifying and attacking the saddle point problem in
high-dimensional non-convex optimization¡±, NIPS, 2014
? Anna Choromanska, Mikael Henaff, Michael Mathieu, G¨¦rard Ben Arous, Yann
LeCun, ¡°The Loss Surfaces of Multilayer Networks¡±, PMLR, 2015
? Jeffrey Pennington, Yasaman Bahri, ¡°Geometry of Neural Network Loss Surfaces
via Random Matrix Theory¡±, PMLR, 2017
? Benjamin D. Haeffele, Rene Vidal, ¡± Global Optimality in Neural Network
Training¡±, CVPR, 2017
64. Training Processing
6-layer CNN on
CIFAR-10
Different initialization / different strategies usually
lead to similar loss (there are some exceptions).
Different
initialization
72. Reference
? Ian J. Goodfellow, Oriol Vinyals, Andrew M. Saxe, ¡°Qualitatively
characterizing neural network optimization problems¡±, ICLR 2015
? Daniel Jiwoong Im, Michael Tao, Kristin Branson, ¡°An Empirical Analysis of
Deep Network Loss Surfaces¡±, arXiv 2016
? Qianli Liao, Tomaso Poggio, ¡°Theory II: Landscape of the Empirical Risk in
Deep Learning¡±, arXiv 2017
? Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein, ¡°Visualizing the
Loss Landscape of Neural Nets¡±, arXiv 2017
74. Concluding Remarks
? Deep linear network is not convex, but all the local minima
are global minima.
? There are saddle points which are hard to escape
? Deep network has local minima.
? We need more theory in the future
? Conjecture:
? When training a larger network, it is rare to meet local
minima.
? All local minima are almost as good as global
? We can try to understand the error surface by visualization.
? The error surface is not as complexed as imagined.