狠狠撸

Tutorial:
Sparse Variational Dropout
Wu Hyun Shin
MLAI, KAIST
7. 24. 2019.

????
? ?? Variational Dropout? ?? ?? ??? ??
? ?? ???? ?? ??? + Dropout layer
? BNN ??? ?? ??? ?? ??? ??? ??? ??
? ? ??? ??, ??? ????? ??? ???? ?? ? ??
? ??? ?? ? ?? ??? ?? ???
? ?? ?? ??? ?? ? ?? ??
? ????
? ??? ???? ?? ???? ?????, ??? ??? ???? ?? ??
? ?? ???? ??? ? ??? ??
? ??? ?? ?? ??? ????? ??

??? ? ???
Binary Dropout (BD)
? Improving neural networks by preventing co-adaptation of feature detectors. Hinton et al. arXiv:1207.0508. 2012. 4002
? Dropout: a simple way to prevent neural networks from overfitting. Srivastava et al. JMLR 2014. 13126
Gaussian Dropout (GD)
? Fast dropout training. Wang et al. ICML 2013. 249
Variational Dropout (VD)
? Variational Dropout and the Local Reparameterization Trick. Kingma et al. NIPS 2015. 326
Sparse Variational Dropout (Sparse VD) ← Final goal!
? Variational Dropout Sparsifies Deep Neural Networks. Molchanov et al. ICML 2017. 148
→ ?? ???? ???? ???? building block? ??
→ ? building block?? ???? ?? ?? ??

Big Picture
Binary
dropout
Gaussian
dropout
Variational
dropout
Bayesian NN /
Variational Inference
Sparse
Variational
dropout

? Multiplicative Bernoulli Noise
? ??
???
= ??
???
? ? ?
?
BD? GD? ???
? ??? ? ?? ?? ?
dropout
? p? ??? retain
? 1-p? ??? drop
? ??? ????? ?
?(? ?)
? Binary Dropout?

BD? GD? ???
? ? ?? ??, ?? ??
? PyTorch??? ??? ????
???? ??.
? Test time? ??? ??? ??
? ??? ? ??
? ??? ??? ???? ??!
1
?
? w?
if ? = 1/?,
? ??? ??? ?? ??

BD? GD? ???
1
?
? w?
? ??? ???? ????.
? Bernoulli random variable ??
? ??? ??(?)
? ? ?? =
1
?
? Pr ?? =
1
?
+ 0 ? Pr ?? = 0 =
1
?
? ? + 0 ? 1 ? ? = ?
? ??? ? ? ?
?
? ? ? ?
?
? ? ??
2
=
1
?
2
? ? + 02
? 1 ? ? =
1
?
? ??? ?? = ? ??
2
? ? ??
2 =
1
?
? 12 =
???
?

BD? GD? ???
? ?? ??? ??? ?? Gaussian random variable ????
? ? = 1, ? =
1??
?
? ??~N ?, ?2
= ? ?,
???
?
? ??? ???? ? =
1??
?
? ??
? ? 1, ? ← ??? ?? ?? ? ??!
≈
? Multiplicative Gaussian
Noise
? ??
???
= ??
???
? ??
? ??~? 1, ? (? =
1??
?
)

Recap: Bayesian Neural Networks
? BNN???
? Weight? ??? ???? ????
? ????
? Bayes’ theorem? ??
? , ????????? =
?????????? ??????
????????
? ??? ??? ??.
? ??? ??? ? ??.
? ?????
? ?? ?? ? ??? ????.
? ??? ? ??: Variational Inference

Recap: Variational Inference
? Variational Inference??
? ??? posterior ?(?|?)? ???? ??
? ????
? ??? ?? ? ? ?? ??? ????: ? ?(?)
? ? ??? ?(?|?)? ??? ???!
? ???? ???
? KL Divergence
? ??? ???? ???
? ? ??? ??? ??? ??
? ? ? ? = argmin
?
??[? ?(?)||? ? ? ]
? Inference → optimization problem
Variational distribution
Variational parameter
Posterior distribution

? ??? ????
? ??[? ?(?)| ? ? ? + ? ? = log?(?)
? ?? ? ? ? maximize?? ?? ??? ??!
? ? ? ? ???
? ? ? = ? ? ? log? ? ? ?? ? ??[? ?(?)||?(?)]
? ??? ??: Expected Log-likelihood + KL regularization
? ELBO(Evidence lower bound)?? ??.
? ?? log? D ≥ ? ?
? ??: ELBO? maximize??!
??

? ??? ??? ???
? ??? ?????.
? ???? classification ??? / FC???? & single ???
? Weight? Gaussian ?(0, ?)? ?? ?? ?? ??(prior) ??
? ????? weight? ?? ??? Gaussian?? ???
? ???? ?? ??? ??(posterior) ??? ??
? Weight ??? ?????
? ??? ?? ??? ???? ??, (min KL term)
? ???? ? ???? ??? ?? ????? ?? (min NLL term)

?
? ?? =B
I O O
B
? ??? ??? posterior ? ? ? :
? ?? Gaussian?? ???: ?~N ?, ?2
? ?? ?? (back-propagation ??)
? ??? ??? ???
Learnable
parameter

?
? ?? =B
I O O
B
?
? ?? =B
I O O
B
? ?⊙
+
? ??? ??? posterior ? ? ? :
? ?? Gaussian?? ???: ?~N ?, ?2
? ?? ?? (back-propagation ??)
? ??? ??? ???
? Reparametrization Trick(RT)
? ? ～ q ? ? = N ?, ?2
→ ? = ? ?, ? , ? ～ ?(?)
→ ? = ? + ?⊙? , ?~ 0, ?

?
? ?? =B
I O O
B
? ?⊙
+
? ??? ??? ???
? ??? ???? ?,
? ELBO? ??? ??? ?? ?? ???? minibatch-based training?? ?!
? ? = {?, ?} ? ?,
? argmax
?
? ? ? log? ? ? ?? ? ??[? ?(?)||?(?)]
? ≈ argmax
?
?
? ?=1
?
log? ? ?
? ?
, ?(?, ? ?
) ? ??[? ?(?)||?(?)]
? ???? ? ?:
? ????? ?????? ??(RT)??? minibatch ?? ??? ??? ?.
? ??? ??? Stochastic Gradient Variational Bayes(SGVB)?? ?.
?? Analytic?? ??Minibatch-based
MC approximation
?(0, ?)

?
? ?? =B
I O O
B
? ?⊙
+
? ??? ??? ???
? argmax
?
?
? ?=1
?
log? ? ?
? ?
, ?(?, ? ?
) ? ??[? ?(?)||?(?)]
? ??????
? ??? ?: ?? Non-Bayesian? ??? ?? ?? ???
? ?, weight? randomness? ??? ??
? ??? ?: prior ?(0, ?)?? KL divergence.
? ??? ?? ???? ?? ???? ??? regularize.

? ????? ???? ??
? argmax
?
? ? ? log? ? ? ?? ? ??[? ?(?)||?(?)]
? SGVB??? Gradient variance?
? randomness? ????? gradient? variance? ??!
? Source: data distribution p(?) / noise distribution p ?
? Variance? ??? ?? ?? ???? ?? ??? ??
? ??? ?(KL term)? ??? ??, closed-form?? ?? ??.
? ?? ???? ??? ??? ??
? ???? gradient variance? ? ??

VD: Variational Dropout
? ?? ??
? SGVB? ????? ????? ???? ??
? Local Reparametrization Trick(LRT)
? Gradient variance? ??? ? ?? ??? ??
? Dropout? variational method? ???? ??
? GD + Varaitional method + LRT = Variational Dropout
? ?? ?? ?? ? ?? ??
? ?? : GD? ?? ?? (with LRT)
? ?? : ?? ??? dropout rate.
? ??? : GD? Bayesian network? ??? ? prior? ?????
← Part 1
← Part 2

VD-Part 1: Local Reparameterization
Trick
? Local Reparameterization Trick(LRT)? ?? ????.
? ??? SGVB? ????? ??
? SGVB? gradient variance? ???!
? ?? ??? ?? Gradient variance? ??? ??
? ??? decomposition? ?? ??

Trick
? SGVB? ?? ????.
? ELBO: (?,?∈?) ? ? ? ?
[log?(?|?, ?)] ? ??[? ? ? ||? ? ]
? ??? KL term? closed-form?? ??? ????? ??.
? Minibatch approximation:
? (?,?∈?) ? ? ? ?
[log?(?|?, ?)] ≈
?
? ?=1
?
???? ?? ??, ?(?, ??)
? ?, SGVB?
?
? ?=1
?
??? ?? ??? ? ??.
? ??? ? ?? ???? ?? likelihood? ???? ????.
? ? ? log? ? ? ??
? : Minibatch size
?: Data size

Trick
? ????
?
? ?=1
?
??? variance??
? ???
?
? ?=1
?
??
? ? ? ?? ???
? Variance? ??? minibatch size ?? ??? ?? ? ??.
? ??, Covariance? ??? ???!
? ??? ??? ??
? Cov ??, ?? = 0
? In Korean: Minibatch ?? ????? log-likelihood? ???? ??

Trick
? ??? ??? ??? ??? ??
?
? ?? =B
I O O
B
?
O
? ?⊙
+
? ?? ??:
? ?? ?? ?? ??? ?? ∈ ?? ???
weight matrix ?? ??
? ??? ?? ??? ? ～ ?(0, ?)? dependent
? ?? ???? ?? ???? ????? ?
? dependent? ??
? Cov ??, ?? ≠ 0
…
…

?
Trick
? ??? ??? ??? ??? ??
?? =B
I O O
B
?
?
?
?
?
O
? ⊙
+
? ?? ???
? ?? ?? ?? ??? ?? ∈ ?? ?? ??
weight matrix ??? ??
? ??? ?? ?? ?? ～ ?(0, ?)? dependent
? ??? ??? dependency? ???
? ??? ??, ?? = 0
? ????
? ?? ?? ?? (???? ?? ?)
? ???? ???
?
?
?
?
…
…

?
?
?
?
Trick
? ??? ??? ??? ??? ??
?
?? =?
? ? ?
?
?
O
? ⊙
+
? ? ?? ???
? ??,?? Gaussian??, ? ?,?? Gaussian.
? If X,Y independent and normally
distributed,
X+Y is also normally distributed.
? ?,?
? ?,?
? ?,?
*?? ??? ?? (X? = ? → ?? = ?)
?
?
?
?

?
?
Trick
? ??? ??? ??? ??? ??
?
?? =?
? ? ? ? ? ?? ???
distributed,
? ???? B?? ?? ??????. → LRT!
? ??? noise → ?? noise
? weight noise → activation noise
?2
?2
? =?
? ? ?
*?? ??? ?? (X? = ? → ?? = ?)
*????? elementwise?? ??.
squared
?⊙
+
=
?
1
2

?
Trick
? ??? ??? ??? ??? ??
?
?? =?
? ? ? ? ? ?? ???
distributed,
? ???? B?? ?? ??????. → LRT!
? ??? noise → ?? noise
? weight noise → activation noise
?2
?2
? =?
? ?
*?? ??? ?? (X? = ? → ?? = ?)
*????? elementwise?? ??.
squared
+squared
?
?
?⊙
1
2

Trick
? ???? ???
? ?? ?? ??? → Cov ??, ?? = 0 → ?? gradient variance!
? ?? ?? (in terms of optimization step)
? ? ?? ??? ?? & ??? ??? ??
? ?? ?? (in terms of wall-clock time)
?
?
?
?
?
x
?
Global noise
Weight noise
Local noise
Activation/Units noise

VD-Part 2
? ????..
? SGVB?? ?? ??? ???? ???: LRT
? ????..
? Dropout? variational method? ???!
? Varational dropout (with LRT)

VD-Part 2: Reinterpretation of GD as
VD
? Dropout? variational method? ??
Gaussian dropout
? Multiplicative noise in units
? ? = ?⊙? ?, ? ～ ? 1, ?
? LRT:
? b ?,? = ? ? ?,? ? ?,? ??,?
? ? ? ?,? = ? ? ?,? ??,? ? ? ?,? = ? ? ?,? ??,?
? ??? ? ?,? = ? ? ?,?
2
??,?
2
??? ? ?,? = ? ? ? ?,?
2
??,?
2
Variational Bayesian Inference
? Noise in weights
? ? = ?W, W ～ ? ?, ??2
? LRT:
? ? ?,? = ? ? ?,? ??,?
? ? ? ?,? = ? ? ?,? ? ??,? = ? ? ?,? ??,?
? ??? ? ?,? = ? ? ?,?
2
??? ??,? = ? ? ? ?,?
2
??,?
2
???
*??? ??? ?? appendix B ?
?.
If
then
mean Multiplicative noise

VD
? Gaussian dropout? Variational method? ???? ???
? Variational Dropout? ??! (???)
? ?? ?? ?? ? ?? ??
? LRT? ??? Gaussian drop?? ??? ?? ??.
? ?? ?? variational parameter? ?? ??? ? ??.
? min
?
??[? ?(?)| ? ? ? ?? ? = ?, ?
? ??? ?? ??: Prior? ???
? Binary dropout ≈ Gaussian Dropout ≈ Variational Dropout
? Binary dropout? central limit theorem? ?? ?? ??
? ??: Fast dropout training. Wang et al. ICML 2013.
mean Multiplicative noise

VD
? ???? prior??
? Gaussian dropout?? consistency? ??(? ?????)
? droprate ?? ?? / weight ?? ???? ?? ? = ?, ?
? ?????? expected log-likelihood term? ???? ??
? W ～ ? ?, ? ?2
? max
? (?,?∈?) ? ? ?|?,?
[log?(?|?, ?)] ? ??[? ?|?, ? ||? ? ]
? ??? ??? ???? prior?
? Log-uniform prior
Has to be Independent to ?(no
effect),
when ? is fixed.

VD
? Log-uniform distribution? ??
log(?) ?
? Zero ???? ?? density → weight? ??? ?? sparsity ??
*MDL(Maximum Description Length) ???? ??:
weight? floating point format?? ?? ? log-uniform distribution? ?? ??,
??? digit? ??? ???? ?? ?? ??. weight? ??? ???? ??. (????)

? Negative KL term? closed-form?? ?? ? ????
? max
? (?,?∈?) ? ? ? ?
[log?(?|?, ?)] ? ??[? ? ? ||? ? ]
? Appendix C? ????,
? ????? ? ??? ?? ??!
? ???, ?? ?? ?? ?? ??? ??
VD
?? independent
Analytically intractable

VD
? ??? ? ??? ?? ????? ????!
? (1) 3? ????? ??:
? (2) ? ??? lower bound:
? ≥ 0 ???,
? ??: ? ≤ 1 , ? ≤ 0.5 ? =
1??
?
? ??? ?? ??, large gradient variance → local minima
Intractable
Approximated
log? = 0 ? ?,
KL = 0 ? ??? C ??
→ ??? drop (? = 1)???!

Sparse VD:
? VD?? ??? ?? ????
? ????: ??? ??,?? ?? (weight? ???? droprate ??)
? Additive Noise Reparameterization (1)
? Gradient variance? ??? ?? ??? ???
? Approximation of the KL Divergence (2)
? ?? ??? ??(e.g. ? ≤ 1) ?? ??
? ? → ∞ / ? → 1 : ?? drop / ?? ??
? ?? ??
? ??????
? ?? sparse? network ??
? Bayesian pruning??? ??

Sparse VD: Additive Noise
Reparametrization
? VD??? ???:
? Droprate ?? ? ???? ?? ?? gradient variance? ?? ?
? ????:
? ??? ?? ??
? ???? ???? ??? ? ?? ?? → ?? ??
? ???? output ??? ??? ? ???? ??
? ? = ? + exp ??? ? ? ? ?
??? + ??? ? ??? ? ???
New
variable
?? ???? ??? ???? ??? ??.
?? ??? ?? ??!

Sparse VD: Approximation of the KL
term
? KL term approximation: ?? ? ???? ? ??? ??
? ??? Heuristic? ??? ??
? ?0.5 log 1 + ??1
? ?? ??
? ?? ??? sigmoid? ????? ?? ???? ?? ?? ???
??1: 0.5 log ? + ?1 ?+c2 ?2+c3 ?3
??2: 0.5 log ?
??? ??

Sparse VD: Sparsity
? ? ? droprate ? ???? ????
? ? → ∞ : ? → 1 ??? ?? drop / ?? ??
? ? ? w?? ? ???? multiplicative noise???? ????
? ? → ∞ : ???? noise / ??? random / ????? ? ??? → ?

Sparse VD: For convolution layers
? Sparse VD for FC layers:
??? ???
2
= ???
2? ?? = ???
?=1
?
? ??
2
???
2
=
?=1
?
? ??
2
???
2
By additive
reparam. trick
? Sparse VD for Conv layers:

Sparse VD: Empirical Observations
? Test time???
? ?? ??? ???? ??? ???? ?? ?? thresholding? ??
? Expected log likelihood term?? KL term? ???? ??? ? ???
? ??? ???? ?? sparsity? ???? ??? ??
? ???? Pretraining or Scaling term ??
? Prior ??? ??? ??
? ?? ???? ???? ?? variance? fitting?? ? ??

Implementation
???? ?? (Theano, Lasagne)
? https://github.com/senya-ashukha/variational-dropout-sparsifies-dnn
?? ???? ?? (TF / ?? ?? / by Google AI research / ?? ???? ???)
? https://github.com/google-research/google-research/tree/master/state_of_sparsity
?? repository (TF / ???)
? https://github.com/cjratcliff/variational-dropout (in progress)
? https://github.com/BayesWatch/tf-variational-dropout (incomplete)

狠狠撸

[??] Tutorial: Sparse variational dropout

Recommended

More Related Content

What's hot (20)

Similar to [??] Tutorial: Sparse variational dropout (20)

[??] Tutorial: Sparse variational dropout