ݺߣ

ݺߣShare a Scribd company logo
Tutorial:
Sparse Variational Dropout
Wu Hyun Shin
MLAI, KAIST
7. 24. 2019.
????
? ?? Variational Dropout? ?? ?? ??? ??
? ?? ???? ?? ??? + Dropout layer
? BNN ??? ?? ??? ?? ??? ??? ??? ??
? ? ??? ??, ??? ????? ??? ???? ?? ? ??
? ??? ?? ? ?? ??? ?? ???
? ?? ?? ??? ?? ? ?? ??
? ????
? ??? ???? ?? ???? ?????, ??? ??? ???? ?? ??
? ?? ???? ??? ? ??? ??
? ??? ?? ?? ??? ????? ??
??? ? ???
Binary Dropout (BD)
? Improving neural networks by preventing co-adaptation of feature detectors. Hinton et al. arXiv:1207.0508. 2012. 4002
? Dropout: a simple way to prevent neural networks from overfitting. Srivastava et al. JMLR 2014. 13126
Gaussian Dropout (GD)
? Fast dropout training. Wang et al. ICML 2013. 249
Variational Dropout (VD)
? Variational Dropout and the Local Reparameterization Trick. Kingma et al. NIPS 2015. 326
Sparse Variational Dropout (Sparse VD)  Final goal!
? Variational Dropout Sparsifies Deep Neural Networks. Molchanov et al. ICML 2017. 148
 ?? ???? ???? ???? building block? ??
 ? building block?? ???? ?? ?? ??
Big Picture
Binary
dropout
Gaussian
dropout
Variational
dropout
Bayesian NN /
Variational Inference
Sparse
Variational
dropout
Big Picture
Binary
dropout
Gaussian
dropout
Variational
dropout
Bayesian NN /
Variational Inference
Sparse
Variational
dropout
? Multiplicative Bernoulli Noise
? ??
???
= ??
???
? ? ?
?
BD? GD? ???
? ??? ? ?? ?? ?
dropout
? p? ??? retain
? 1-p? ??? drop
? ??? ????? ?
?(? ?)
? Binary Dropout?
BD? GD? ???
? ? ?? ??, ?? ??
? PyTorch??? ??? ????
???? ??.
? Test time? ??? ??? ??
? ??? ? ??
? ??? ??? ???? ??!
1
?
? w?
if ? = 1/?,
? ??? ??? ?? ??
BD? GD? ???
1
?
? w?
? ??? ???? ????.
? Bernoulli random variable ??
? ??? ??(?)
? ? ?? =
1
?
? Pr ?? =
1
?
+ 0 ? Pr ?? = 0 =
1
?
? ? + 0 ? 1 ? ? = ?
? ??? ? ? ?
?
? ? ? ?
?
? ? ??
2
=
1
?
2
? ? + 02
? 1 ? ? =
1
?
? ??? ?? = ? ??
2
? ? ??
2 =
1
?
? 12 =
???
?
BD? GD? ???
? ?? ??? ??? ?? Gaussian random variable ????
? ? = 1, ? =
1??
?
? ??~N ?, ?2
= ? ?,
???
?
? ??? ???? ? =
1??
?
? ??
? ? 1, ?  ??? ?? ?? ? ??!

? Multiplicative Gaussian
Noise
? ??
???
= ??
???
? ??
? ??~? 1, ? (? =
1??
?
)
Big Picture
Binary
dropout
Gaussian
dropout
Variational
dropout
Bayesian NN /
Variational Inference
Sparse
Variational
dropout
Recap: Bayesian Neural Networks
? BNN???
? Weight? ??? ???? ????
? ????
? Bayes theorem? ??
? , ????????? =
?????????? ??????
????????
? ??? ??? ??.
? ??? ??? ? ??.
? ?????
? ?? ?? ? ??? ????.
? ??? ? ??: Variational Inference
Recap: Variational Inference
? Variational Inference??
? ??? posterior ?(?|?)? ???? ??
? ????
? ??? ?? ? ? ?? ??? ????: ? ?(?)
? ? ??? ?(?|?)? ??? ???!
? ???? ???
? KL Divergence
? ??? ???? ???
? ? ??? ??? ??? ??
? ? ? ? = argmin
?
??[? ?(?)||? ? ? ]
? Inference  optimization problem
Variational distribution
Variational parameter
Posterior distribution
Recap: Variational Inference
? ??? ????
? ??[? ?(?)| ? ? ? + ? ? = log?(?)
? ?? ? ? ? maximize?? ?? ??? ??!
? ? ? ? ???
? ? ? = ? ? ? log? ? ? ?? ? ??[? ?(?)||?(?)]
? ??? ??: Expected Log-likelihood + KL regularization
? ELBO(Evidence lower bound)?? ??.
? ?? log? D  ? ?
? ??: ELBO? maximize??!
??
Recap: Variational Inference
? ??? ??? ???
? ??? ?????.
? ???? classification ??? / FC???? & single ???
? Weight? Gaussian ?(0, ?)? ?? ?? ?? ??(prior) ??
? ????? weight? ?? ??? Gaussian?? ???
? ???? ?? ??? ??(posterior) ??? ??
? Weight ??? ?????
? ??? ?? ??? ???? ??, (min KL term)
? ???? ? ???? ??? ?? ????? ?? (min NLL term)
Recap: Variational Inference
?
? ?? =B
I O O
B
? ??? ??? posterior ? ? ? :
? ?? Gaussian?? ???: ?~N ?, ?2
? ?? ?? (back-propagation ??)
? ??? ??? ???
Learnable
parameter
Recap: Variational Inference
?
? ?? =B
I O O
B
?
? ?? =B
I O O
B
? ?
+
? ??? ??? posterior ? ? ? :
? ?? Gaussian?? ???: ?~N ?, ?2
? ?? ?? (back-propagation ??)
? ??? ??? ???
? Reparametrization Trick(RT)
? ?  q ? ? = N ?, ?2
 ? = ? ?, ? , ?  ?(?)
 ? = ? + ?? , ?~ 0, ?
Recap: Variational Inference
?
? ?? =B
I O O
B
? ?
+
? ??? ??? ???
? ??? ???? ?,
? ELBO? ??? ??? ?? ?? ???? minibatch-based training?? ?!
? ? = {?, ?} ? ?,
? argmax
?
? ? ? log? ? ? ?? ? ??[? ?(?)||?(?)]
?  argmax
?
?
? ?=1
?
log? ? ?
? ?
, ?(?, ? ?
) ? ??[? ?(?)||?(?)]
? ???? ? ?:
? ????? ?????? ??(RT)??? minibatch ?? ??? ??? ?.
? ??? ??? Stochastic Gradient Variational Bayes(SGVB)?? ?.
?? Analytic?? ??Minibatch-based
MC approximation
?(0, ?)
Recap: Variational Inference
?
? ?? =B
I O O
B
? ?
+
? ??? ??? ???
? argmax
?
?
? ?=1
?
log? ? ?
? ?
, ?(?, ? ?
) ? ??[? ?(?)||?(?)]
? ??????
? ??? ?: ?? Non-Bayesian? ??? ?? ?? ???
? ?, weight? randomness? ??? ??
? ??? ?: prior ?(0, ?)?? KL divergence.
? ??? ?? ???? ?? ???? ??? regularize.
Recap: Variational Inference
? ????? ???? ??
? argmax
?
? ? ? log? ? ? ?? ? ??[? ?(?)||?(?)]
? SGVB??? Gradient variance?
? randomness? ????? gradient? variance? ??!
? Source: data distribution p(?) / noise distribution p ?
? Variance? ??? ?? ?? ???? ?? ??? ??
? ??? ?(KL term)? ??? ??, closed-form?? ?? ??.
? ?? ???? ??? ??? ??
? ???? gradient variance? ? ??
Big Picture
Binary
dropout
Gaussian
dropout
Variational
dropout
Bayesian NN /
Variational Inference
Sparse
Variational
dropout
VD: Variational Dropout
? ?? ??
? SGVB? ????? ????? ???? ??
? Local Reparametrization Trick(LRT)
? Gradient variance? ??? ? ?? ??? ??
? Dropout? variational method? ???? ??
? GD + Varaitional method + LRT = Variational Dropout
? ?? ?? ?? ? ?? ??
? ?? : GD? ?? ?? (with LRT)
? ?? : ?? ??? dropout rate.
? ??? : GD? Bayesian network? ??? ? prior? ?????
 Part 1
 Part 2
VD-Part 1: Local Reparameterization
Trick
? Local Reparameterization Trick(LRT)? ?? ????.
? ??? SGVB? ????? ??
? SGVB? gradient variance? ???!
? ?? ??? ?? Gradient variance? ??? ??
? ??? decomposition? ?? ??
VD-Part 1: Local Reparameterization
Trick
? SGVB? ?? ????.
? ELBO: (?,??) ? ? ? ?
[log?(?|?, ?)] ? ??[? ? ? ||? ? ]
? ??? KL term? closed-form?? ??? ????? ??.
? Minibatch approximation:
? (?,??) ? ? ? ?
[log?(?|?, ?)] 
?
? ?=1
?
???? ?? ??, ?(?, ??)
? ?, SGVB?
?
? ?=1
?
??? ?? ??? ? ??.
? ??? ? ?? ???? ?? likelihood? ???? ????.
? ? ? log? ? ? ??
? : Minibatch size
?: Data size
VD-Part 1: Local Reparameterization
Trick
? ????
?
? ?=1
?
??? variance??
? ???
?
? ?=1
?
??
? ? ? ?? ???
? Variance? ??? minibatch size ?? ??? ?? ? ??.
? ??, Covariance? ??? ???!
? ??? ??? ??
? Cov ??, ?? = 0
? In Korean: Minibatch ?? ????? log-likelihood? ???? ??
VD-Part 1: Local Reparameterization
Trick
? ??? ??? ??? ??? ??
?
? ?? =B
I O O
B
?
O
? ?
+
? ?? ??:
? ?? ?? ?? ??? ??  ?? ???
weight matrix ?? ??
? ??? ?? ??? ?  ?(0, ?)? dependent
? ?? ???? ?? ???? ????? ?
? dependent? ??
? Cov ??, ??  0
?
VD-Part 1: Local Reparameterization
Trick
? ??? ??? ??? ??? ??
?? =B
I O O
B
?
?
?
?
?
O
? 
+
? ?? ???
? ?? ?? ?? ??? ??  ?? ?? ??
weight matrix ??? ??
? ??? ?? ?? ??  ?(0, ?)? dependent
? ??? ??? dependency? ???
? ??? ??, ?? = 0
? ????
? ?? ?? ?? (???? ?? ?)
? ???? ???
?
?
?
?
?
?
?
?
VD-Part 1: Local Reparameterization
Trick
? ??? ??? ??? ??? ??
?
?? =?
? ? ?
?
?
O
? 
+
? ? ?? ???
? ??,?? Gaussian??, ? ?,?? Gaussian.
? If X,Y independent and normally
distributed,
X+Y is also normally distributed.
? ?,?
? ?,?
? ?,?
*?? ??? ?? (X? = ?  ?? = ?)
?
?
?
?
?
?
VD-Part 1: Local Reparameterization
Trick
? ??? ??? ??? ??? ??
?
?? =?
? ? ? ? ? ?? ???
? ??,?? Gaussian??, ? ?,?? Gaussian.
? If X,Y independent and normally
distributed,
X+Y is also normally distributed.
? ???? B?? ?? ??????.  LRT!
? ??? noise  ?? noise
? weight noise  activation noise
?2
?2
? =?
? ? ?
*?? ??? ?? (X? = ?  ?? = ?)
*????? elementwise?? ??.
squared
?
+
=
?
1
2
?
VD-Part 1: Local Reparameterization
Trick
? ??? ??? ??? ??? ??
?
?? =?
? ? ? ? ? ?? ???
? ??,?? Gaussian??, ? ?,?? Gaussian.
? If X,Y independent and normally
distributed,
X+Y is also normally distributed.
? ???? B?? ?? ??????.  LRT!
? ??? noise  ?? noise
? weight noise  activation noise
?2
?2
? =?
? ?
*?? ??? ?? (X? = ?  ?? = ?)
*????? elementwise?? ??.
squared
+squared
?
?
?
1
2
VD-Part 1: Local Reparameterization
Trick
? ???? ???
? ?? ?? ???  Cov ??, ?? = 0  ?? gradient variance!
? ?? ?? (in terms of optimization step)
? ? ?? ??? ?? & ??? ??? ??
? ?? ?? (in terms of wall-clock time)
?
?
?
?
?
x
?
Global noise
Weight noise
Local noise
Activation/Units noise
VD-Part 2
? ????..
? SGVB?? ?? ??? ???? ???: LRT
? ????..
? Dropout? variational method? ???!
? Varational dropout (with LRT)
VD-Part 2: Reinterpretation of GD as
VD
? Dropout? variational method? ??
Gaussian dropout
? Multiplicative noise in units
? ? = ?? ?, ?  ? 1, ?
? LRT:
? b ?,? = ? ? ?,? ? ?,? ??,?
? ? ? ?,? = ? ? ?,? ??,? ? ? ?,? = ? ? ?,? ??,?
? ??? ? ?,? = ? ? ?,?
2
??,?
2
??? ? ?,? = ? ? ? ?,?
2
??,?
2
Variational Bayesian Inference
? Noise in weights
? ? = ?W, W  ? ?, ??2
? LRT:
? ? ?,? = ? ? ?,? ??,?
? ? ? ?,? = ? ? ?,? ? ??,? = ? ? ?,? ??,?
? ??? ? ?,? = ? ? ?,?
2
??? ??,? = ? ? ? ?,?
2
??,?
2
???
*??? ??? ?? appendix B ?
?.
If
then
mean Multiplicative noise
VD-Part 2: Reinterpretation of GD as
VD
? Gaussian dropout? Variational method? ???? ???
? Variational Dropout? ??! (???)
? ?? ?? ?? ? ?? ??
? LRT? ??? Gaussian drop?? ??? ?? ??.
? ?? ?? variational parameter? ?? ??? ? ??.
? min
?
??[? ?(?)| ? ? ? ?? ? = ?, ?
? ??? ?? ??: Prior? ???
? Binary dropout  Gaussian Dropout  Variational Dropout
? Binary dropout? central limit theorem? ?? ?? ??
? ??: Fast dropout training. Wang et al. ICML 2013.
mean Multiplicative noise
VD-Part 2: Reinterpretation of GD as
VD
? ???? prior??
? Gaussian dropout?? consistency? ??(? ?????)
? droprate ?? ?? / weight ?? ???? ?? ? = ?, ?
? ?????? expected log-likelihood term? ???? ??
? W  ? ?, ? ?2
? max
? (?,??) ? ? ?|?,?
[log?(?|?, ?)] ? ??[? ?|?, ? ||? ? ]
? ??? ??? ???? prior?
? Log-uniform prior
Has to be Independent to ?(no
effect),
when ? is fixed.
VD-Part 2: Reinterpretation of GD as
VD
? Log-uniform distribution? ??
log(?) ?
? Zero ???? ?? density  weight? ??? ?? sparsity ??
*MDL(Maximum Description Length) ???? ??:
weight? floating point format?? ?? ? log-uniform distribution? ?? ??,
??? digit? ??? ???? ?? ?? ??. weight? ??? ???? ??. (????)
? Negative KL term? closed-form?? ?? ? ????
? max
? (?,??) ? ? ? ?
[log?(?|?, ?)] ? ??[? ? ? ||? ? ]
? Appendix C? ????,
? ????? ? ??? ?? ??!
? ???, ?? ?? ?? ?? ??? ??
VD-Part 2: Reinterpretation of GD as
VD
?? independent
Analytically intractable
VD-Part 2: Reinterpretation of GD as
VD
? ??? ? ??? ?? ????? ????!
? (1) 3? ????? ??:
? (2) ? ??? lower bound:
?  0 ???,
? ??: ?  1 , ?  0.5 ? =
1??
?
? ??? ?? ??, large gradient variance  local minima
Intractable
Approximated
log? = 0 ? ?,
KL = 0 ? ??? C ??
 ??? drop (? = 1)???!
Big Picture
Binary
dropout
Gaussian
dropout
Variational
dropout
Bayesian NN /
Variational Inference
Sparse
Variational
dropout
Sparse VD:
? VD?? ??? ?? ????
? ????: ??? ??,?? ?? (weight? ???? droprate ??)
? Additive Noise Reparameterization (1)
? Gradient variance? ??? ?? ??? ???
? Approximation of the KL Divergence (2)
? ?? ??? ??(e.g. ?  1) ?? ??
? ?   / ?  1 : ?? drop / ?? ??
? ?? ??
? ??????
? ?? sparse? network ??
? Bayesian pruning??? ??
Sparse VD: Additive Noise
Reparametrization
? VD??? ???:
? Droprate ?? ? ???? ?? ?? gradient variance? ?? ?
? ????:
? ??? ?? ??
? ???? ???? ??? ? ?? ??  ?? ??
? ???? output ??? ??? ? ???? ??
? ? = ? + exp ??? ? ? ? ?
??? + ??? ? ??? ? ???
New
variable
?? ???? ??? ???? ??? ??.
?? ??? ?? ??!
Sparse VD: Approximation of the KL
term
? KL term approximation: ?? ? ???? ? ??? ??
? ??? Heuristic? ??? ??
? ?0.5 log 1 + ??1
? ?? ??
? ?? ??? sigmoid? ????? ?? ???? ?? ?? ???
??1: 0.5 log ? + ?1 ?+c2 ?2+c3 ?3
??2: 0.5 log ?
??? ??
Sparse VD: Sparsity
? ? ? droprate ? ???? ????
? ?   : ?  1 ??? ?? drop / ?? ??
? ? ? w?? ? ???? multiplicative noise???? ????
? ?   : ???? noise / ??? random / ????? ? ???  ?
Sparse VD: For convolution layers
? Sparse VD for FC layers:
??? ???
2
= ???
2? ?? = ???
?=1
?
? ??
2
???
2
=
?=1
?
? ??
2
???
2
By additive
reparam. trick
? Sparse VD for Conv layers:
Sparse VD: Empirical Observations
? Test time???
? ?? ??? ???? ??? ???? ?? ?? thresholding? ??
? Expected log likelihood term?? KL term? ???? ??? ? ???
? ??? ???? ?? sparsity? ???? ??? ??
? ???? Pretraining or Scaling term ??
? Prior ??? ??? ??
? ?? ???? ???? ?? variance? fitting?? ? ??
Big Picture
Binary
dropout
Gaussian
dropout
Variational
dropout
Bayesian NN /
Variational Inference
Sparse
Variational
dropout
Implementation
???? ?? (Theano, Lasagne)
? https://github.com/senya-ashukha/variational-dropout-sparsifies-dnn
?? ???? ?? (TF / ?? ?? / by Google AI research / ?? ???? ???)
? https://github.com/google-research/google-research/tree/master/state_of_sparsity
?? repository (TF / ???)
? https://github.com/cjratcliff/variational-dropout (in progress)
? https://github.com/BayesWatch/tf-variational-dropout (incomplete)
Any questions?

More Related Content

What's hot (20)

Unsupervised Data Augmentation for Consistency Training
Unsupervised Data Augmentation for Consistency TrainingUnsupervised Data Augmentation for Consistency Training
Unsupervised Data Augmentation for Consistency Training
Sungchul Kim
?
Causal Inference : Primer (2019-06-01 ???)
Causal Inference : Primer (2019-06-01 ???)Causal Inference : Primer (2019-06-01 ???)
Causal Inference : Primer (2019-06-01 ???)
Minho Lee
?
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
?? ?
?
Focal loss? ??(Detection & Classification)
Focal loss? ??(Detection & Classification)Focal loss? ??(Detection & Classification)
Focal loss? ??(Detection & Classification)
?? ?
?
[DL݆i]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) ʽǩ`...
[DL݆i]Revisiting Deep Learning Models for Tabular Data  (NeurIPS 2021) ʽǩ`...[DL݆i]Revisiting Deep Learning Models for Tabular Data  (NeurIPS 2021) ʽǩ`...
[DL݆i]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) ʽǩ`...
Deep Learning JP
?
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNE
Tomoki Hayashi
?
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
?
߳任ȤλԪ
߳任ȤλԪ߳任ȤλԪ
߳任ȤλԪ
Shogo Muramatsu
?
boosting ?? ?? (bagging vs boosting)
boosting ?? ?? (bagging vs boosting)boosting ?? ?? (bagging vs boosting)
boosting ?? ?? (bagging vs boosting)
SANG WON PARK
?
[DL݆i]Flow-based Deep Generative Models
[DL݆i]Flow-based Deep Generative Models[DL݆i]Flow-based Deep Generative Models
[DL݆i]Flow-based Deep Generative Models
Deep Learning JP
?
ײǥηɢѥ`ǰֲˤĤ
ײǥηɢѥ`ǰֲˤĤײǥηɢѥ`ǰֲˤĤ
ײǥηɢѥ`ǰֲˤĤ
hoxo_m
?
[PAP] ?????????? ????? ?????????? ?????? : Best Practices
[PAP] ?????????? ????? ?????????? ?????? : Best Practices[PAP] ?????????? ????? ?????????? ?????? : Best Practices
[PAP] ?????????? ????? ?????????? ?????? : Best Practices
Bokyung Choi
?
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with DiscussionsIntroduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Joonyoung Yi
?
ʸѳi#1
ʸѳi#1ʸѳi#1
ʸѳi#1
matsuolab
?
Bayes Independence Test - HSIC ܤ^-
Bayes Independence Test - HSIC ܤ^-Bayes Independence Test - HSIC ܤ^-
Bayes Independence Test - HSIC ܤ^-
Joe Suzuki
?
?? ???? SVM(?, ???? ????)
?? ???? SVM(?, ???? ????)?? ???? SVM(?, ???? ????)
?? ???? SVM(?, ???? ????)
SANG WON PARK
?
[DL݆i]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
 [DL݆i]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se... [DL݆i]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
[DL݆i]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
Deep Learning JP
?
GAN - Generative Adversarial Nets
GAN - Generative Adversarial NetsGAN - Generative Adversarial Nets
GAN - Generative Adversarial Nets
KyeongUkJang
?
Unsupervised Data Augmentation for Consistency Training
Unsupervised Data Augmentation for Consistency TrainingUnsupervised Data Augmentation for Consistency Training
Unsupervised Data Augmentation for Consistency Training
Sungchul Kim
?
Causal Inference : Primer (2019-06-01 ???)
Causal Inference : Primer (2019-06-01 ???)Causal Inference : Primer (2019-06-01 ???)
Causal Inference : Primer (2019-06-01 ???)
Minho Lee
?
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
InfoGAN: Interpretable Representation Learning by Information Maximizing Gene...
?? ?
?
Focal loss? ??(Detection & Classification)
Focal loss? ??(Detection & Classification)Focal loss? ??(Detection & Classification)
Focal loss? ??(Detection & Classification)
?? ?
?
[DL݆i]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) ʽǩ`...
[DL݆i]Revisiting Deep Learning Models for Tabular Data  (NeurIPS 2021) ʽǩ`...[DL݆i]Revisiting Deep Learning Models for Tabular Data  (NeurIPS 2021) ʽǩ`...
[DL݆i]Revisiting Deep Learning Models for Tabular Data (NeurIPS 2021) ʽǩ`...
Deep Learning JP
?
Visualizing Data Using t-SNE
Visualizing Data Using t-SNEVisualizing Data Using t-SNE
Visualizing Data Using t-SNE
Tomoki Hayashi
?
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision TransformersEmerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Sungchul Kim
?
boosting ?? ?? (bagging vs boosting)
boosting ?? ?? (bagging vs boosting)boosting ?? ?? (bagging vs boosting)
boosting ?? ?? (bagging vs boosting)
SANG WON PARK
?
[DL݆i]Flow-based Deep Generative Models
[DL݆i]Flow-based Deep Generative Models[DL݆i]Flow-based Deep Generative Models
[DL݆i]Flow-based Deep Generative Models
Deep Learning JP
?
ײǥηɢѥ`ǰֲˤĤ
ײǥηɢѥ`ǰֲˤĤײǥηɢѥ`ǰֲˤĤ
ײǥηɢѥ`ǰֲˤĤ
hoxo_m
?
[PAP] ?????????? ????? ?????????? ?????? : Best Practices
[PAP] ?????????? ????? ?????????? ?????? : Best Practices[PAP] ?????????? ????? ?????????? ?????? : Best Practices
[PAP] ?????????? ????? ?????????? ?????? : Best Practices
Bokyung Choi
?
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with DiscussionsIntroduction to MAML (Model Agnostic Meta Learning) with Discussions
Introduction to MAML (Model Agnostic Meta Learning) with Discussions
Joonyoung Yi
?
Bayes Independence Test - HSIC ܤ^-
Bayes Independence Test - HSIC ܤ^-Bayes Independence Test - HSIC ܤ^-
Bayes Independence Test - HSIC ܤ^-
Joe Suzuki
?
?? ???? SVM(?, ???? ????)
?? ???? SVM(?, ???? ????)?? ???? SVM(?, ???? ????)
?? ???? SVM(?, ???? ????)
SANG WON PARK
?
[DL݆i]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
 [DL݆i]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se... [DL݆i]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
[DL݆i]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
Deep Learning JP
?
GAN - Generative Adversarial Nets
GAN - Generative Adversarial NetsGAN - Generative Adversarial Nets
GAN - Generative Adversarial Nets
KyeongUkJang
?

Similar to [??] Tutorial: Sparse variational dropout (20)

Dsh data sensitive hashing for high dimensional k-nn search
Dsh  data sensitive hashing for high dimensional k-nn searchDsh  data sensitive hashing for high dimensional k-nn search
Dsh data sensitive hashing for high dimensional k-nn search
WooSung Choi
?
The fastalgorithmfordeepbeliefnets
The fastalgorithmfordeepbeliefnetsThe fastalgorithmfordeepbeliefnets
The fastalgorithmfordeepbeliefnets
Lee Gyeong Hoon
?
Data Visualization and t-SNE
Data Visualization and t-SNEData Visualization and t-SNE
Data Visualization and t-SNE
Hyeongmin Lee
?
??? ???? : Gaussian Processes
??? ???? : Gaussian Processes??? ???? : Gaussian Processes
??? ???? : Gaussian Processes
Jungkyu Lee
?
???? ? Trpo
???? ? Trpo???? ? Trpo
???? ? Trpo
Woong won Lee
?
Vector Optimization
Vector Optimization Vector Optimization
Vector Optimization
SEMINARGROOT
?
Lecture 4: Neural Networks I
Lecture 4: Neural Networks ILecture 4: Neural Networks I
Lecture 4: Neural Networks I
Sang Jun Lee
?
CVPR 2022 Tutorial? ?? ?? ??? Diffusion Probabilistic Model
CVPR 2022 Tutorial? ?? ?? ??? Diffusion Probabilistic ModelCVPR 2022 Tutorial? ?? ?? ??? Diffusion Probabilistic Model
CVPR 2022 Tutorial? ?? ?? ??? Diffusion Probabilistic Model
jaypi Ko
?
Anomaly Detection based on Diffusion
Anomaly Detection based on DiffusionAnomaly Detection based on Diffusion
Anomaly Detection based on Diffusion
ssuserbaebf8
?
02.09 naive bayesian classifier
02.09 naive bayesian classifier02.09 naive bayesian classifier
02.09 naive bayesian classifier
Dea-hwan Ki
?
Eigendecomposition and pca
Eigendecomposition and pcaEigendecomposition and pca
Eigendecomposition and pca
Jinhwan Suk
?
Crash Course on Graphical models
Crash Course on Graphical modelsCrash Course on Graphical models
Crash Course on Graphical models
Jong Wook Kim
?
?? ?? ?? Reinforcement Learning an introduction
?? ?? ?? Reinforcement Learning an introduction?? ?? ?? Reinforcement Learning an introduction
?? ?? ?? Reinforcement Learning an introduction
Taehoon Kim
?
SVM
SVMSVM
SVM
Jeonghun Yoon
?
Lecture 3: Unsupervised Learning
Lecture 3: Unsupervised LearningLecture 3: Unsupervised Learning
Lecture 3: Unsupervised Learning
Sang Jun Lee
?
Variational AutoEncoder(VAE)
Variational AutoEncoder(VAE)Variational AutoEncoder(VAE)
Variational AutoEncoder(VAE)
??? ???
?
3 sat with randomization
3 sat with randomization3 sat with randomization
3 sat with randomization
Changki Yun
?
Variational inference intro. (korean ver.)
Variational inference intro. (korean ver.)Variational inference intro. (korean ver.)
Variational inference intro. (korean ver.)
Kiho Hong
?
Coursera Machine Learning (by Andrew Ng)_????
Coursera Machine Learning (by Andrew Ng)_????Coursera Machine Learning (by Andrew Ng)_????
Coursera Machine Learning (by Andrew Ng)_????
SANG WON PARK
?
Supervised Constrastive Learning
Supervised Constrastive LearningSupervised Constrastive Learning
Supervised Constrastive Learning
Sungchul Kim
?
Dsh data sensitive hashing for high dimensional k-nn search
Dsh  data sensitive hashing for high dimensional k-nn searchDsh  data sensitive hashing for high dimensional k-nn search
Dsh data sensitive hashing for high dimensional k-nn search
WooSung Choi
?
The fastalgorithmfordeepbeliefnets
The fastalgorithmfordeepbeliefnetsThe fastalgorithmfordeepbeliefnets
The fastalgorithmfordeepbeliefnets
Lee Gyeong Hoon
?
Data Visualization and t-SNE
Data Visualization and t-SNEData Visualization and t-SNE
Data Visualization and t-SNE
Hyeongmin Lee
?
??? ???? : Gaussian Processes
??? ???? : Gaussian Processes??? ???? : Gaussian Processes
??? ???? : Gaussian Processes
Jungkyu Lee
?
Lecture 4: Neural Networks I
Lecture 4: Neural Networks ILecture 4: Neural Networks I
Lecture 4: Neural Networks I
Sang Jun Lee
?
CVPR 2022 Tutorial? ?? ?? ??? Diffusion Probabilistic Model
CVPR 2022 Tutorial? ?? ?? ??? Diffusion Probabilistic ModelCVPR 2022 Tutorial? ?? ?? ??? Diffusion Probabilistic Model
CVPR 2022 Tutorial? ?? ?? ??? Diffusion Probabilistic Model
jaypi Ko
?
Anomaly Detection based on Diffusion
Anomaly Detection based on DiffusionAnomaly Detection based on Diffusion
Anomaly Detection based on Diffusion
ssuserbaebf8
?
02.09 naive bayesian classifier
02.09 naive bayesian classifier02.09 naive bayesian classifier
02.09 naive bayesian classifier
Dea-hwan Ki
?
Eigendecomposition and pca
Eigendecomposition and pcaEigendecomposition and pca
Eigendecomposition and pca
Jinhwan Suk
?
Crash Course on Graphical models
Crash Course on Graphical modelsCrash Course on Graphical models
Crash Course on Graphical models
Jong Wook Kim
?
?? ?? ?? Reinforcement Learning an introduction
?? ?? ?? Reinforcement Learning an introduction?? ?? ?? Reinforcement Learning an introduction
?? ?? ?? Reinforcement Learning an introduction
Taehoon Kim
?
Lecture 3: Unsupervised Learning
Lecture 3: Unsupervised LearningLecture 3: Unsupervised Learning
Lecture 3: Unsupervised Learning
Sang Jun Lee
?
Variational AutoEncoder(VAE)
Variational AutoEncoder(VAE)Variational AutoEncoder(VAE)
Variational AutoEncoder(VAE)
??? ???
?
3 sat with randomization
3 sat with randomization3 sat with randomization
3 sat with randomization
Changki Yun
?
Variational inference intro. (korean ver.)
Variational inference intro. (korean ver.)Variational inference intro. (korean ver.)
Variational inference intro. (korean ver.)
Kiho Hong
?
Coursera Machine Learning (by Andrew Ng)_????
Coursera Machine Learning (by Andrew Ng)_????Coursera Machine Learning (by Andrew Ng)_????
Coursera Machine Learning (by Andrew Ng)_????
SANG WON PARK
?
Supervised Constrastive Learning
Supervised Constrastive LearningSupervised Constrastive Learning
Supervised Constrastive Learning
Sungchul Kim
?

[??] Tutorial: Sparse variational dropout

  • 1. Tutorial: Sparse Variational Dropout Wu Hyun Shin MLAI, KAIST 7. 24. 2019.
  • 2. ???? ? ?? Variational Dropout? ?? ?? ??? ?? ? ?? ???? ?? ??? + Dropout layer ? BNN ??? ?? ??? ?? ??? ??? ??? ?? ? ? ??? ??, ??? ????? ??? ???? ?? ? ?? ? ??? ?? ? ?? ??? ?? ??? ? ?? ?? ??? ?? ? ?? ?? ? ???? ? ??? ???? ?? ???? ?????, ??? ??? ???? ?? ?? ? ?? ???? ??? ? ??? ?? ? ??? ?? ?? ??? ????? ??
  • 3. ??? ? ??? Binary Dropout (BD) ? Improving neural networks by preventing co-adaptation of feature detectors. Hinton et al. arXiv:1207.0508. 2012. 4002 ? Dropout: a simple way to prevent neural networks from overfitting. Srivastava et al. JMLR 2014. 13126 Gaussian Dropout (GD) ? Fast dropout training. Wang et al. ICML 2013. 249 Variational Dropout (VD) ? Variational Dropout and the Local Reparameterization Trick. Kingma et al. NIPS 2015. 326 Sparse Variational Dropout (Sparse VD) Final goal! ? Variational Dropout Sparsifies Deep Neural Networks. Molchanov et al. ICML 2017. 148 ?? ???? ???? ???? building block? ?? ? building block?? ???? ?? ?? ??
  • 4. Big Picture Binary dropout Gaussian dropout Variational dropout Bayesian NN / Variational Inference Sparse Variational dropout
  • 5. Big Picture Binary dropout Gaussian dropout Variational dropout Bayesian NN / Variational Inference Sparse Variational dropout
  • 6. ? Multiplicative Bernoulli Noise ? ?? ??? = ?? ??? ? ? ? ? BD? GD? ??? ? ??? ? ?? ?? ? dropout ? p? ??? retain ? 1-p? ??? drop ? ??? ????? ? ?(? ?) ? Binary Dropout?
  • 7. BD? GD? ??? ? ? ?? ??, ?? ?? ? PyTorch??? ??? ???? ???? ??. ? Test time? ??? ??? ?? ? ??? ? ?? ? ??? ??? ???? ??! 1 ? ? w? if ? = 1/?, ? ??? ??? ?? ??
  • 8. BD? GD? ??? 1 ? ? w? ? ??? ???? ????. ? Bernoulli random variable ?? ? ??? ??(?) ? ? ?? = 1 ? ? Pr ?? = 1 ? + 0 ? Pr ?? = 0 = 1 ? ? ? + 0 ? 1 ? ? = ? ? ??? ? ? ? ? ? ? ? ? ? ? ? ?? 2 = 1 ? 2 ? ? + 02 ? 1 ? ? = 1 ? ? ??? ?? = ? ?? 2 ? ? ?? 2 = 1 ? ? 12 = ??? ?
  • 9. BD? GD? ??? ? ?? ??? ??? ?? Gaussian random variable ???? ? ? = 1, ? = 1?? ? ? ??~N ?, ?2 = ? ?, ??? ? ? ??? ???? ? = 1?? ? ? ?? ? ? 1, ? ??? ?? ?? ? ??! ? Multiplicative Gaussian Noise ? ?? ??? = ?? ??? ? ?? ? ??~? 1, ? (? = 1?? ? )
  • 10. Big Picture Binary dropout Gaussian dropout Variational dropout Bayesian NN / Variational Inference Sparse Variational dropout
  • 11. Recap: Bayesian Neural Networks ? BNN??? ? Weight? ??? ???? ???? ? ???? ? Bayes theorem? ?? ? , ????????? = ?????????? ?????? ???????? ? ??? ??? ??. ? ??? ??? ? ??. ? ????? ? ?? ?? ? ??? ????. ? ??? ? ??: Variational Inference
  • 12. Recap: Variational Inference ? Variational Inference?? ? ??? posterior ?(?|?)? ???? ?? ? ???? ? ??? ?? ? ? ?? ??? ????: ? ?(?) ? ? ??? ?(?|?)? ??? ???! ? ???? ??? ? KL Divergence ? ??? ???? ??? ? ? ??? ??? ??? ?? ? ? ? ? = argmin ? ??[? ?(?)||? ? ? ] ? Inference optimization problem Variational distribution Variational parameter Posterior distribution
  • 13. Recap: Variational Inference ? ??? ???? ? ??[? ?(?)| ? ? ? + ? ? = log?(?) ? ?? ? ? ? maximize?? ?? ??? ??! ? ? ? ? ??? ? ? ? = ? ? ? log? ? ? ?? ? ??[? ?(?)||?(?)] ? ??? ??: Expected Log-likelihood + KL regularization ? ELBO(Evidence lower bound)?? ??. ? ?? log? D ? ? ? ??: ELBO? maximize??! ??
  • 14. Recap: Variational Inference ? ??? ??? ??? ? ??? ?????. ? ???? classification ??? / FC???? & single ??? ? Weight? Gaussian ?(0, ?)? ?? ?? ?? ??(prior) ?? ? ????? weight? ?? ??? Gaussian?? ??? ? ???? ?? ??? ??(posterior) ??? ?? ? Weight ??? ????? ? ??? ?? ??? ???? ??, (min KL term) ? ???? ? ???? ??? ?? ????? ?? (min NLL term)
  • 15. Recap: Variational Inference ? ? ?? =B I O O B ? ??? ??? posterior ? ? ? : ? ?? Gaussian?? ???: ?~N ?, ?2 ? ?? ?? (back-propagation ??) ? ??? ??? ??? Learnable parameter
  • 16. Recap: Variational Inference ? ? ?? =B I O O B ? ? ?? =B I O O B ? ? + ? ??? ??? posterior ? ? ? : ? ?? Gaussian?? ???: ?~N ?, ?2 ? ?? ?? (back-propagation ??) ? ??? ??? ??? ? Reparametrization Trick(RT) ? ? q ? ? = N ?, ?2 ? = ? ?, ? , ? ?(?) ? = ? + ?? , ?~ 0, ?
  • 17. Recap: Variational Inference ? ? ?? =B I O O B ? ? + ? ??? ??? ??? ? ??? ???? ?, ? ELBO? ??? ??? ?? ?? ???? minibatch-based training?? ?! ? ? = {?, ?} ? ?, ? argmax ? ? ? ? log? ? ? ?? ? ??[? ?(?)||?(?)] ? argmax ? ? ? ?=1 ? log? ? ? ? ? , ?(?, ? ? ) ? ??[? ?(?)||?(?)] ? ???? ? ?: ? ????? ?????? ??(RT)??? minibatch ?? ??? ??? ?. ? ??? ??? Stochastic Gradient Variational Bayes(SGVB)?? ?. ?? Analytic?? ??Minibatch-based MC approximation ?(0, ?)
  • 18. Recap: Variational Inference ? ? ?? =B I O O B ? ? + ? ??? ??? ??? ? argmax ? ? ? ?=1 ? log? ? ? ? ? , ?(?, ? ? ) ? ??[? ?(?)||?(?)] ? ?????? ? ??? ?: ?? Non-Bayesian? ??? ?? ?? ??? ? ?, weight? randomness? ??? ?? ? ??? ?: prior ?(0, ?)?? KL divergence. ? ??? ?? ???? ?? ???? ??? regularize.
  • 19. Recap: Variational Inference ? ????? ???? ?? ? argmax ? ? ? ? log? ? ? ?? ? ??[? ?(?)||?(?)] ? SGVB??? Gradient variance? ? randomness? ????? gradient? variance? ??! ? Source: data distribution p(?) / noise distribution p ? ? Variance? ??? ?? ?? ???? ?? ??? ?? ? ??? ?(KL term)? ??? ??, closed-form?? ?? ??. ? ?? ???? ??? ??? ?? ? ???? gradient variance? ? ??
  • 20. Big Picture Binary dropout Gaussian dropout Variational dropout Bayesian NN / Variational Inference Sparse Variational dropout
  • 21. VD: Variational Dropout ? ?? ?? ? SGVB? ????? ????? ???? ?? ? Local Reparametrization Trick(LRT) ? Gradient variance? ??? ? ?? ??? ?? ? Dropout? variational method? ???? ?? ? GD + Varaitional method + LRT = Variational Dropout ? ?? ?? ?? ? ?? ?? ? ?? : GD? ?? ?? (with LRT) ? ?? : ?? ??? dropout rate. ? ??? : GD? Bayesian network? ??? ? prior? ????? Part 1 Part 2
  • 22. VD-Part 1: Local Reparameterization Trick ? Local Reparameterization Trick(LRT)? ?? ????. ? ??? SGVB? ????? ?? ? SGVB? gradient variance? ???! ? ?? ??? ?? Gradient variance? ??? ?? ? ??? decomposition? ?? ??
  • 23. VD-Part 1: Local Reparameterization Trick ? SGVB? ?? ????. ? ELBO: (?,??) ? ? ? ? [log?(?|?, ?)] ? ??[? ? ? ||? ? ] ? ??? KL term? closed-form?? ??? ????? ??. ? Minibatch approximation: ? (?,??) ? ? ? ? [log?(?|?, ?)] ? ? ?=1 ? ???? ?? ??, ?(?, ??) ? ?, SGVB? ? ? ?=1 ? ??? ?? ??? ? ??. ? ??? ? ?? ???? ?? likelihood? ???? ????. ? ? ? log? ? ? ?? ? : Minibatch size ?: Data size
  • 24. VD-Part 1: Local Reparameterization Trick ? ???? ? ? ?=1 ? ??? variance?? ? ??? ? ? ?=1 ? ?? ? ? ? ?? ??? ? Variance? ??? minibatch size ?? ??? ?? ? ??. ? ??, Covariance? ??? ???! ? ??? ??? ?? ? Cov ??, ?? = 0 ? In Korean: Minibatch ?? ????? log-likelihood? ???? ??
  • 25. VD-Part 1: Local Reparameterization Trick ? ??? ??? ??? ??? ?? ? ? ?? =B I O O B ? O ? ? + ? ?? ??: ? ?? ?? ?? ??? ?? ?? ??? weight matrix ?? ?? ? ??? ?? ??? ? ?(0, ?)? dependent ? ?? ???? ?? ???? ????? ? ? dependent? ?? ? Cov ??, ?? 0
  • 26. ? VD-Part 1: Local Reparameterization Trick ? ??? ??? ??? ??? ?? ?? =B I O O B ? ? ? ? ? O ? + ? ?? ??? ? ?? ?? ?? ??? ?? ?? ?? ?? weight matrix ??? ?? ? ??? ?? ?? ?? ?(0, ?)? dependent ? ??? ??? dependency? ??? ? ??? ??, ?? = 0 ? ???? ? ?? ?? ?? (???? ?? ?) ? ???? ??? ? ? ? ?
  • 27. ? ? ? ? VD-Part 1: Local Reparameterization Trick ? ??? ??? ??? ??? ?? ? ?? =? ? ? ? ? ? O ? + ? ? ?? ??? ? ??,?? Gaussian??, ? ?,?? Gaussian. ? If X,Y independent and normally distributed, X+Y is also normally distributed. ? ?,? ? ?,? ? ?,? *?? ??? ?? (X? = ? ?? = ?) ? ? ? ?
  • 28. ? ? VD-Part 1: Local Reparameterization Trick ? ??? ??? ??? ??? ?? ? ?? =? ? ? ? ? ? ?? ??? ? ??,?? Gaussian??, ? ?,?? Gaussian. ? If X,Y independent and normally distributed, X+Y is also normally distributed. ? ???? B?? ?? ??????. LRT! ? ??? noise ?? noise ? weight noise activation noise ?2 ?2 ? =? ? ? ? *?? ??? ?? (X? = ? ?? = ?) *????? elementwise?? ??. squared ? + = ? 1 2
  • 29. ? VD-Part 1: Local Reparameterization Trick ? ??? ??? ??? ??? ?? ? ?? =? ? ? ? ? ? ?? ??? ? ??,?? Gaussian??, ? ?,?? Gaussian. ? If X,Y independent and normally distributed, X+Y is also normally distributed. ? ???? B?? ?? ??????. LRT! ? ??? noise ?? noise ? weight noise activation noise ?2 ?2 ? =? ? ? *?? ??? ?? (X? = ? ?? = ?) *????? elementwise?? ??. squared +squared ? ? ? 1 2
  • 30. VD-Part 1: Local Reparameterization Trick ? ???? ??? ? ?? ?? ??? Cov ??, ?? = 0 ?? gradient variance! ? ?? ?? (in terms of optimization step) ? ? ?? ??? ?? & ??? ??? ?? ? ?? ?? (in terms of wall-clock time) ? ? ? ? ? x ? Global noise Weight noise Local noise Activation/Units noise
  • 31. VD-Part 2 ? ????.. ? SGVB?? ?? ??? ???? ???: LRT ? ????.. ? Dropout? variational method? ???! ? Varational dropout (with LRT)
  • 32. VD-Part 2: Reinterpretation of GD as VD ? Dropout? variational method? ?? Gaussian dropout ? Multiplicative noise in units ? ? = ?? ?, ? ? 1, ? ? LRT: ? b ?,? = ? ? ?,? ? ?,? ??,? ? ? ? ?,? = ? ? ?,? ??,? ? ? ?,? = ? ? ?,? ??,? ? ??? ? ?,? = ? ? ?,? 2 ??,? 2 ??? ? ?,? = ? ? ? ?,? 2 ??,? 2 Variational Bayesian Inference ? Noise in weights ? ? = ?W, W ? ?, ??2 ? LRT: ? ? ?,? = ? ? ?,? ??,? ? ? ? ?,? = ? ? ?,? ? ??,? = ? ? ?,? ??,? ? ??? ? ?,? = ? ? ?,? 2 ??? ??,? = ? ? ? ?,? 2 ??,? 2 ??? *??? ??? ?? appendix B ? ?. If then mean Multiplicative noise
  • 33. VD-Part 2: Reinterpretation of GD as VD ? Gaussian dropout? Variational method? ???? ??? ? Variational Dropout? ??! (???) ? ?? ?? ?? ? ?? ?? ? LRT? ??? Gaussian drop?? ??? ?? ??. ? ?? ?? variational parameter? ?? ??? ? ??. ? min ? ??[? ?(?)| ? ? ? ?? ? = ?, ? ? ??? ?? ??: Prior? ??? ? Binary dropout Gaussian Dropout Variational Dropout ? Binary dropout? central limit theorem? ?? ?? ?? ? ??: Fast dropout training. Wang et al. ICML 2013. mean Multiplicative noise
  • 34. VD-Part 2: Reinterpretation of GD as VD ? ???? prior?? ? Gaussian dropout?? consistency? ??(? ?????) ? droprate ?? ?? / weight ?? ???? ?? ? = ?, ? ? ?????? expected log-likelihood term? ???? ?? ? W ? ?, ? ?2 ? max ? (?,??) ? ? ?|?,? [log?(?|?, ?)] ? ??[? ?|?, ? ||? ? ] ? ??? ??? ???? prior? ? Log-uniform prior Has to be Independent to ?(no effect), when ? is fixed.
  • 35. VD-Part 2: Reinterpretation of GD as VD ? Log-uniform distribution? ?? log(?) ? ? Zero ???? ?? density weight? ??? ?? sparsity ?? *MDL(Maximum Description Length) ???? ??: weight? floating point format?? ?? ? log-uniform distribution? ?? ??, ??? digit? ??? ???? ?? ?? ??. weight? ??? ???? ??. (????)
  • 36. ? Negative KL term? closed-form?? ?? ? ???? ? max ? (?,??) ? ? ? ? [log?(?|?, ?)] ? ??[? ? ? ||? ? ] ? Appendix C? ????, ? ????? ? ??? ?? ??! ? ???, ?? ?? ?? ?? ??? ?? VD-Part 2: Reinterpretation of GD as VD ?? independent Analytically intractable
  • 37. VD-Part 2: Reinterpretation of GD as VD ? ??? ? ??? ?? ????? ????! ? (1) 3? ????? ??: ? (2) ? ??? lower bound: ? 0 ???, ? ??: ? 1 , ? 0.5 ? = 1?? ? ? ??? ?? ??, large gradient variance local minima Intractable Approximated log? = 0 ? ?, KL = 0 ? ??? C ?? ??? drop (? = 1)???!
  • 38. Big Picture Binary dropout Gaussian dropout Variational dropout Bayesian NN / Variational Inference Sparse Variational dropout
  • 39. Sparse VD: ? VD?? ??? ?? ???? ? ????: ??? ??,?? ?? (weight? ???? droprate ??) ? Additive Noise Reparameterization (1) ? Gradient variance? ??? ?? ??? ??? ? Approximation of the KL Divergence (2) ? ?? ??? ??(e.g. ? 1) ?? ?? ? ? / ? 1 : ?? drop / ?? ?? ? ?? ?? ? ?????? ? ?? sparse? network ?? ? Bayesian pruning??? ??
  • 40. Sparse VD: Additive Noise Reparametrization ? VD??? ???: ? Droprate ?? ? ???? ?? ?? gradient variance? ?? ? ? ????: ? ??? ?? ?? ? ???? ???? ??? ? ?? ?? ?? ?? ? ???? output ??? ??? ? ???? ?? ? ? = ? + exp ??? ? ? ? ? ??? + ??? ? ??? ? ??? New variable ?? ???? ??? ???? ??? ??. ?? ??? ?? ??!
  • 41. Sparse VD: Approximation of the KL term ? KL term approximation: ?? ? ???? ? ??? ?? ? ??? Heuristic? ??? ?? ? ?0.5 log 1 + ??1 ? ?? ?? ? ?? ??? sigmoid? ????? ?? ???? ?? ?? ??? ??1: 0.5 log ? + ?1 ?+c2 ?2+c3 ?3 ??2: 0.5 log ? ??? ??
  • 42. Sparse VD: Sparsity ? ? ? droprate ? ???? ???? ? ? : ? 1 ??? ?? drop / ?? ?? ? ? ? w?? ? ???? multiplicative noise???? ???? ? ? : ???? noise / ??? random / ????? ? ??? ?
  • 43. Sparse VD: For convolution layers ? Sparse VD for FC layers: ??? ??? 2 = ??? 2? ?? = ??? ?=1 ? ? ?? 2 ??? 2 = ?=1 ? ? ?? 2 ??? 2 By additive reparam. trick ? Sparse VD for Conv layers:
  • 44. Sparse VD: Empirical Observations ? Test time??? ? ?? ??? ???? ??? ???? ?? ?? thresholding? ?? ? Expected log likelihood term?? KL term? ???? ??? ? ??? ? ??? ???? ?? sparsity? ???? ??? ?? ? ???? Pretraining or Scaling term ?? ? Prior ??? ??? ?? ? ?? ???? ???? ?? variance? fitting?? ? ??
  • 45. Big Picture Binary dropout Gaussian dropout Variational dropout Bayesian NN / Variational Inference Sparse Variational dropout
  • 46. Implementation ???? ?? (Theano, Lasagne) ? https://github.com/senya-ashukha/variational-dropout-sparsifies-dnn ?? ???? ?? (TF / ?? ?? / by Google AI research / ?? ???? ???) ? https://github.com/google-research/google-research/tree/master/state_of_sparsity ?? repository (TF / ???) ? https://github.com/cjratcliff/variational-dropout (in progress) ? https://github.com/BayesWatch/tf-variational-dropout (incomplete)