ΊέΊέί£

ΊέΊέί£Share a Scribd company logo
Auto-Encoders and Variational Auto-Encoders
In this chapter, we will discuss about
β€’ What is Autoencoder
- neural networks whose dimension of input and output are same
- if the autoencoder use only linear activations and the cost
function is MES, then it is same to PCA
- the architecture of a stacked autoencoder is typically symmetrical
Contents
β€’ Autoencoders
β€’ Denoising Autoencoders
β€’ Variational Autoencoders
Auto-Encoders
Introduction
β€’ Autoencoder λŠ” encoder 와 decoder 둜 λ‚˜λ‰œλ‹€.
였λ₯Έμͺ½ diagram μ—μ„œ 𝑓 κ°€ encoder 이고,
𝑔 κ°€ decoder 이닀.
β€’ cost function 은 reconstruction error λ₯Ό μ‚¬μš©ν•œλ‹€:
𝐿(π‘₯, 𝑔 𝑓 π‘₯ = π‘₯ βˆ’ 𝑔 𝑓 π‘₯ 2
β€’ hidden layer κ°€ 2개 이상인 AEλ₯Ό stacked Autoencoder 라고 ν•œλ‹€.
β€’ e.g. linear activation 만 μ‚¬μš©ν•˜κ³ , cost function 으둜 MSE λ₯Ό μ‚¬μš©
ν•œ Autoencoder λŠ” PCA 이닀.
π‘₯
β„Ž
π‘Ÿ
𝑓 𝑔
Introduction
β€’ 보톡 hidden layer κ°œμˆ˜λŠ” ν™€μˆ˜μ΄λ©° AE의 μ •μ˜μ— μ˜ν•΄
center layer(=coding layer)λ₯Ό κΈ°μ€€μœΌλ‘œ μ–‘μͺ½μ˜ unit κ°œμˆ˜κ°€ κ°™λ‹€.
Coding layer
Encoder Decoder
Introduction
βˆ™ μ•„λž˜μ˜ 그림처럼 symmetric ν•˜μ§€ μ•ŠλŠ” Autoencoder 도 μžˆλ‹€.
βˆ™ 보톡은 encoder λΆ€λΆ„μ˜ λ§ˆμ§€λ§‰ layer 의 unit 의 κ°œμˆ˜λŠ” input layer의
dimension 보닀 μž‘λ‹€. 이런 Autoencoder λ₯Ό undercomplete 라고 ν•˜κ³ 
λ°˜λŒ€μ˜ 경우λ₯Ό overcomplete 라고 ν•œλ‹€.
* overcomplete 이면 inclusion + projection 으둜 loss λ₯Ό 0 으둜 λ§Œλ“€ 수 있음
Cost function
β€’ [VLBM, p.2] An autoencoder takes an input vector π‘₯ ∈ 0,1 𝑑
, and first
maps it to a hidden representation y ∈ 0,1 𝑑′
through a deterministic
mapping 𝑦 = π‘“πœƒ π‘₯ = 𝑠(π‘Šπ‘₯ + 𝑏), parameterized by πœƒ = {π‘Š, 𝑏}.
π‘Š is a 𝑑 Γ— 𝑑′
weight matrix and 𝒃 is a bias vector. The resulting latent
representation 𝑦 is then mapped back to a β€œreconstructed” vector
z ∈ 0,1 𝑑 in input space 𝑧 = 𝑔 πœƒβ€² 𝑦 = 𝑠 π‘Šβ€² 𝑦 + 𝑏′ with πœƒβ€² = {π‘Šβ€², 𝑏′}.
The weight matrix π‘Šβ€² of the reverse mapping may optionally be
constrained by π‘Šβ€² = π‘Š 𝑇, in which case the autoencoder is said to
have tied weights.
β€’ The parameters of this model are optimized to minimize
the average reconstruction error:
πœƒβˆ—
, πœƒβ€²βˆ—
= arg min
πœƒ,πœƒβ€²
1
𝑛
𝑖=1
𝑛
𝐿(π‘₯ 𝑖
, 𝑧 𝑖
)
= arg min
πœƒ,πœƒβ€²
1
𝑛 𝑖=1
𝑛
𝐿(π‘₯ 𝑖
, 𝑔 πœƒβ€²(π‘“πœƒ π‘₯ 𝑖
) (1)
where 𝐿 π‘₯ βˆ’ 𝑧 = π‘₯ βˆ’ 𝑧 2.
Cost function
β€’ [VLBM,p.2] An alternative loss, suggested by the interpretation of π‘₯
and 𝑧 as either bit vectors or vectors of bit probabilities (Bernoullis) is
the reconstruction cross-entropy:
𝐿 𝐻 π‘₯, 𝑧 = 𝐻(𝐡π‘₯||𝐡𝑧)
= βˆ’ π‘˜=1
𝑑
[π‘₯ π‘˜ log 𝑧 π‘˜ + log 1 βˆ’ π‘₯ π‘˜ log(1 βˆ’ 𝑧 π‘˜)]
where π΅πœ‡ π‘₯ = (π΅πœ‡1
π‘₯ , β‹― , π΅πœ‡ 𝑑
π‘₯ ) is a Bernoulli distribution.
β€’ [VLBM,p.2] Equation (1) with 𝐿 = 𝐿 𝐻 can be written
πœƒβˆ—, πœƒβ€²βˆ— = arg min
πœƒ,πœƒβ€²
𝐸 π‘ž0 [𝐿 𝐻(𝑋, 𝑔 πœƒβ€²(π‘“πœƒ 𝑋 ))]
where π‘ž0(𝑋) denotes the empirical distribution associated to our 𝑛
training inputs.
Denoising Auto-Encoders
Denoising Autoencoders
β€’ [VLBM,p.3] corrupted input λ₯Ό λ„£μ–΄ repaired input 을 μ°ΎλŠ” training
을 ν•œλ‹€. μ’€ 더 μ •ν™•ν•˜κ²Œ λ§ν•˜λ©΄ input data의 dimension 이 𝑑 라고
ν•  λ•Œ β€˜desired proportion 𝜈 of destruction’ 을 μ •ν•˜μ—¬ πœˆπ‘‘ 만큼의
input 을 0 으둜 μˆ˜μ •ν•˜λŠ” λ°©μ‹μœΌλ‘œ destruction ν•œλ‹€. μ—¬κΈ°μ„œ πœˆπ‘‘ 개
components λŠ” random 으둜 μ •ν•œλ‹€. 이후 reconstruction error λ₯Ό
μ²΄ν¬ν•˜μ—¬ destruction μ „μ˜ input 으둜 λ³΅κ΅¬ν•˜λŠ” cost function
μ΄μš©ν•˜μ—¬ ν•™μŠ΅μ„ ν•˜λŠ” Autoencoder κ°€ Denoising Autoencoder 이닀.
μ—¬κΈ°μ„œ π‘₯ μ—μ„œ destroyed version 인 π‘₯ λ₯Ό μ–»λŠ” 과정은 stochastic
mapping π‘₯ ~ π‘ž 𝐷( π‘₯|x) λ₯Ό λ”°λ₯Έλ‹€.
Denoising Autoencoders
β€’ [VLBM,p.3] Let us define the joint distribution
π‘ž0
𝑋, 𝑋, π‘Œ = π‘ž0
𝑋 π‘ž 𝐷 𝑋 𝑋 𝛿𝑓 πœƒ 𝑋
(π‘Œ))
where 𝛿 𝑒(𝑣) is the Kronecker delta. Thus π‘Œ is a deterministic
function of 𝑋. π‘ž0(𝑋, 𝑋, π‘Œ) is parameterized by πœƒ. The objective
function minimized by stochastic gradient descent becomes:
π‘Žπ‘Ÿπ‘” π‘šπ‘–π‘›
πœƒ,πœƒβ€²
𝐸 π‘ž0 π‘₯, π‘₯ [𝐿 𝐻(π‘₯, 𝑔 πœƒβ€²(π‘“πœƒ( π‘₯))])
Corrupted π‘₯ λ₯Ό μž…λ ₯ν•˜κ³  π‘₯ λ₯Ό μ°ΎλŠ” λ°©λ²•μœΌλ‘œ ν•™μŠ΅!
Other Autoencoders
β€’ [DL book, 14.2.1] A sparse autoencoder is simply an autoencoder
whose training criterion involves a sparsity penalty Ξ© β„Ž on the
code layer h, in addition to the reconstruction error:
𝐿(π‘₯, 𝑔 𝑓 π‘₯ ) + Ξ©(β„Ž),
where 𝑔 β„Ž is the decoder output and typically we have β„Ž = 𝑓(π‘₯),
the encoder output.
β€’ [DL book, 14.2.3] Another strategy for regularizing an autoencoder
is to use a penalty Ξ© as in sparse autoencoders,
𝐿 π‘₯, 𝑔 𝑓 π‘₯ ) + Ξ©(β„Ž, π‘₯ ,
but with a different form of Ξ©:
Ξ© β„Ž, π‘₯ = πœ†
𝑖
𝛻π‘₯β„Žπ‘–
2
Variational Auto-Encoders
Reference
β€’ Reference. Auto-Encoding Variational Bayes, Diederik P Kingma,
Max Welling, 2013
β€’ [λ…Όλ¬Έμ˜ κ°€μ •] We will restrict ourselves here to the common case
where we have an i.i.d. dataset with latent variables per datapoint,
and where we like to perform maximum likelihood (ML) or
maximum a posteriori (MAP) inference on the (global) parameters,
and variational inference on the latent variables.
Definition
β€’ Generative Model 의 λͺ©ν‘œ
- 𝑋 = {π‘₯𝑖} λ₯Ό μƒμ„±ν•˜λŠ” 집합 𝑍 = {𝑧𝑗} 와 ν•¨μˆ˜ πœƒ λ₯Ό
μ°ΎλŠ” 것이 λͺ©ν‘œμ΄λ‹€.
i.e. Finding arg min
𝑍,πœƒ
𝑑(π‘₯, πœƒ(𝑧)) where 𝑑 is a metric
β€’ VAE 도 generative model μ΄λ―€λ‘œ 집합 𝑍 와
ν•¨μˆ˜ πœƒ κ°€ 유기적으둜 λ™μž‘ν•˜μ—¬ 쒋은 λͺ¨λΈμ„
λ§Œλ“€κ²Œ λ˜λŠ”λ° VAEλŠ” Latent variable 𝑧 κ°€
parametrized distribution(by πœ™) μ—μ„œ λ‚˜μ˜¨λ‹€
κ³  κ°€μ •ν•˜κ³  π‘₯ λ₯Ό 잘 μƒμ„±ν•˜λŠ” parameter πœƒ
λ₯Ό ν•™μŠ΅ν•˜κ²Œ λœλ‹€.
Figure 1
z
π‘₯
πœƒ
Problem scenario
β€’ Dataset 𝑋 = π‘₯ 𝑖
𝑖=1
𝑁
λŠ” i.i.d 인 continuous(λ˜λŠ” discrete) variable π‘₯
의 sample 이닀. 𝑋 λŠ” unobserved continuous variable 𝑧 의 some
random process 둜 μƒμ„±λ˜μ—ˆλ‹€κ³  κ°€μ •ν•˜μž. μ—¬κΈ°μ„œ random process
λŠ” 두 개의 step 으둜 κ΅¬μ„±λ˜μ–΄μžˆλ‹€:
(1) 𝑧(𝑖)
is generated from some prior distribution 𝑝 πœƒβˆ—(𝑧)
(2) π‘₯(𝑖)
is generated from some conditional distribution 𝑝 πœƒβˆ—(π‘₯|𝑧)
β€’ Prior 𝑝 πœƒβˆ— 𝑧 와 likelihood 𝑝 πœƒβˆ—(π‘₯|𝑧) λŠ” differentiable almost
everywhere w.r.t πœƒ and 𝑧 인 parametric families of distributions
𝑝 πœƒ(𝑧)와 𝑝 πœƒ(π‘₯|𝑧) λ“€μ˜ μ—μ„œ 온 κ²ƒμœΌλ‘œ κ°€μ •ν•˜μž.
β€’ μ•„μ‰½κ²Œλ„ true parameter πœƒβˆ—
와 latent variables 𝑧(𝑖)
의 값을 μ•Œ 수
μ—†μ–΄μ„œ cost function 을 μ •ν•˜κ³  κ·Έκ²ƒμ˜ lower bound λ₯Ό κ΅¬ν•˜λŠ”
λ°©ν–₯으둜 μ „κ°œλ  μ˜ˆμ •μ΄λ‹€.
Intractibility and Variational Inference
β€’ 𝑝 πœƒ π‘₯ = 𝑝 πœƒ π‘₯ 𝑝 πœƒ π‘₯ 𝑧 𝑑𝑧 is intractable(계산 λΆˆκ°€λŠ₯)
∡ 𝑝 πœƒ 𝑧 π‘₯ = 𝑝 πœƒ π‘₯ 𝑧 𝑝 πœƒ(𝑧)/𝑝 πœƒ(π‘₯) is intractable
β€’ 𝑝 πœƒ 𝑧 π‘₯ λ₯Ό μ•Œ 수 μ—†μœΌλ‹ˆ μš°λ¦¬κ°€ μ•„λŠ” ν•¨μˆ˜λΌκ³  κ°€μ •ν•˜μž.
이런 방법을 variational inference 라고 ν•œλ‹€. 즉, 잘 μ•„λŠ” ν•¨μˆ˜
π‘ž πœ™(𝑧|π‘₯) λ₯Ό 𝑝 πœƒ 𝑧 π‘₯ λŒ€μ‹  μ‚¬μš©ν•˜λŠ” 방법을 variational inference
라고 ν•œλ‹€.
β€’ Idea : prior 𝑝 πœƒ 𝑧 π‘₯ λ₯Ό π‘ž πœ™(𝑧|π‘₯) 둜 μ‚¬μš©ν•΄λ„ λ˜λŠ” μ΄μœ λŠ” 주어진
input 𝑧 κ°€ π‘₯ 에 κ·Όμ‚¬ν•˜κ²Œ ν•™μŠ΅μ΄ λœλ‹€.
Then variational bound
β€’ μš°λ¦¬λŠ” 𝑋 = π‘₯ 𝑖
𝑖=1
𝑁
κ°€ i.i.d λ₯Ό λ§Œμ‘±ν•œλ‹€κ³  κ°€μ •ν–ˆμœΌλ―€λ‘œ
log 𝑝 πœƒ(π‘₯ 1
, β‹― , π‘₯(𝑁)
) = 𝑖=1
𝑁
log 𝑝 πœƒ(π‘₯ 𝑖
) 이닀.
β€’ 각각의 π‘₯ ∈ 𝑋에 λŒ€ν•΄ μ•„λž˜ 식을 얻을 수 μžˆλ‹€:
log 𝑝 πœƒ π‘₯ = 𝑧
π‘ž πœ™ 𝑧 π‘₯ log 𝑝 πœƒ(π‘₯) 𝑑𝑧
= 𝑧
π‘ž πœ™ 𝑧 π‘₯ log
𝑝 πœƒ 𝑧,π‘₯
𝑝 πœƒ(𝑧|π‘₯)
𝑑𝑧
= 𝑧
π‘ž πœ™ 𝑧 π‘₯ log
𝑝 πœƒ 𝑧,π‘₯
π‘ž πœ™ 𝑧 π‘₯
π‘ž πœ™ 𝑧 π‘₯
𝑝 πœƒ 𝑧 π‘₯
𝑑𝑧
= 𝑧
π‘ž πœ™ 𝑧 π‘₯ log
𝑝 πœƒ 𝑧,π‘₯
π‘ž πœ™ 𝑧 π‘₯
𝑑𝑧 + 𝑧
π‘ž πœ™ 𝑧 π‘₯
π‘ž πœ™ 𝑧 π‘₯
𝑝 πœƒ 𝑧 π‘₯
𝑑𝑧
= 𝓛 πœƒ, πœ™; π‘₯ + 𝐷 𝐾𝐿(π‘ž πœƒ(𝑧|π‘₯)||𝑝 πœƒ 𝑧 π‘₯ ) (1)
β‰₯ 𝓛 πœƒ, πœ™; π‘₯
μ–Έμ œλ‚˜ 𝐾𝐿 divergence λŠ”
0 μ΄μƒμ΄λ―€λ‘œ
𝑧
π‘ž πœ™ 𝑧 π‘₯ log
𝑝 πœƒ 𝑧,π‘₯
π‘ž πœ™ 𝑧 π‘₯
𝑑𝑧 λŠ”
lower bound κ°€ λœλ‹€. 이것
을 𝓛 πœƒ, πœ™; π‘₯ 라고 λ†“μž.
Cost function
β€’ (Eq.3) 𝓛 πœƒ, πœ™; π‘₯ = βˆ’π· 𝐾𝐿(π‘ž πœ™(𝑧|π‘₯)||𝑝 πœƒ π‘₯ ) + 𝐸 π‘ž πœ™ 𝑧 π‘₯ [log 𝑝 πœƒ(π‘₯|𝑧)]
<Proof>
𝓛 πœƒ, πœ™; π‘₯ = 𝑧
π‘ž πœ™ 𝑧 π‘₯ log
𝑝 πœƒ(𝑧,π‘₯)
π‘ž πœ™(𝑧|π‘₯)
𝑑𝑧
= 𝑧
π‘ž πœ™ 𝑧 π‘₯ log
𝑝 πœƒ 𝑧)𝑝 πœƒ(π‘₯|𝑧
π‘ž πœ™ (𝑧|π‘₯)
𝑑𝑧
= 𝑧
π‘ž πœ™ 𝑧 π‘₯ log
𝑝 πœƒ(𝑧)
π‘ž πœ™ (𝑧|π‘₯)
𝑑𝑧 + 𝑧
π‘ž πœ™ 𝑧 π‘₯ log 𝑝 πœƒ π‘₯ 𝑧 𝑑𝑧
= βˆ’π· 𝐾𝐿(π‘ž πœ™(𝑧|π‘₯)| 𝑝 πœƒ(𝑧) + 𝐸 π‘ž πœ™ 𝑧 π‘₯ [log 𝑝 πœƒ(π‘₯|𝑧)]
π‘ž πœ™μ™€ 𝑝 πœƒκ°€ normal 이면 계산 κ°€λŠ₯!
Reconstruction error
Cost function
β€’ Lemma. If 𝑝 π‘₯ ~𝑁 πœ‡1, 𝜎1
2
and π‘ž π‘₯ ~𝑁 πœ‡2, 𝜎2
2
, then
𝐾𝐿(𝑝(π‘₯)||π‘ž π‘₯ ) = ln
𝜎2
𝜎1
+
𝜎1
2
+ πœ‡1 βˆ’ πœ‡2
2
2𝜎2
2 βˆ’
1
2
β€’ Corollary. π‘ž πœ™ 𝑧 π‘₯ ~ 𝑁(πœ‡π‘–, πœŽπ‘–
2
𝐼) 이고 𝑝 𝑧 ~𝑁 0,1 이면
𝐾𝐿 π‘ž πœ™ 𝑧 π‘₯𝑖 𝑝 𝑧 =
1
2
(π‘‘π‘Ÿ πœŽπ‘–
2
𝐼 + πœ‡π‘– 𝑖
π‘‡πœ‡ βˆ’ 𝐽 + ln
1
𝑗=𝑖
𝐽
πœŽπ‘–,𝑗
2
)
= (Σ𝑗=1
𝐽
πœŽπ‘–,𝑗
2
+ Σ𝑗=1
𝐽
πœ‡π‘–,𝑗
2
βˆ’ 𝐽 βˆ’ Σ𝑗=1
𝐽
ln(πœŽπ‘–,𝑗
2
))
=
1
2
Σ𝑗=1
𝐽
(πœ‡π‘–,𝑗
2
+ πœŽπ‘–,𝑗
2
βˆ’ 1 βˆ’ ln πœŽπ‘–,𝑗
2
)
β€’ 즉, π‘ž πœ™ 𝑧 π‘₯ ~ 𝑁(πœ‡π‘–, πœŽπ‘–
2
𝐼) 이고 𝑝 𝑧 ~𝑁 0,1 이면 Eq.3 λŠ” μ•„λž˜μ™€ κ°™λ‹€.
𝓛 πœƒ, πœ™; π‘₯ =
1
2
Σ𝑗=1
𝐽
πœ‡π‘–,𝑗
2
+ πœŽπ‘–,𝑗
2
βˆ’ 1 βˆ’ ln πœŽπ‘–,𝑗
2
+ 𝐸 π‘ž πœ™ 𝑧 π‘₯ [log 𝑝 πœƒ(π‘₯|𝑧)]
Reconstruction error ν•™μŠ΅ 방법
β€’ 𝐸 π‘ž πœ™ 𝑧 π‘₯ [log 𝑝 πœƒ(π‘₯|𝑧)]λŠ” sampling 을 톡해 Monte-carlo estimation ν•œλ‹€.
즉, π‘₯𝑖 ∈ X λ§ˆλ‹€ 𝑧 𝑖,1
, β‹― , 𝑧 𝑖 𝐿 λ₯Ό sampling ν•˜μ—¬ log likelihood 의 mean으둜
κ·Όμ‚¬μ‹œν‚¨λ‹€. 보톡 𝐿 = 1 을 많이 μ‚¬μš©ν•œλ‹€.
𝐸 π‘ž πœ™ 𝑧 π‘₯ [log 𝑝 πœƒ(π‘₯|𝑧)] ∼
1
𝐿
Σ𝑙=1
𝐿
log(𝑝 πœƒ π‘₯𝑖 𝑧 𝑖,𝑙 )
β€’ μ΄λ ‡κ²Œ sampling 을 ν•˜λ©΄ backpropagation 을 ν•  수 μ—†λ‹€. κ·Έλž˜μ„œ μ‚¬μš©
λ˜λŠ” 방법이 reparametrization trick 이닀.
π‘ž πœ™ π‘₯ 𝑧 𝑝 πœƒ π‘₯ 𝑧
Reparametrization trick
β€’ Reparametrization trick : sampling μ—μ„œ μƒκΈ°λŠ” backpropagation
λΆˆκ°€λŠ₯ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•œ 방법.
𝑧~𝑁 πœ‡, 𝜎2
β‡’ 𝑧 = πœ‡ + πœŽπœ– where πœ–~𝑁(0,1)
Log Likelihood
β€’ μ•žμ—μ„œ reconstruction error λ₯Ό 계산할 λ•Œ ν•˜λ‚˜μ”© sampling 을 ν•˜λ©΄
𝐸 π‘ž πœ™ 𝑧 π‘₯ [log 𝑝 πœƒ(π‘₯|𝑧)] ∼
1
𝐿
Σ𝑙=1
𝐿
log 𝑝 πœƒ π‘₯𝑖 𝑧 𝑖,𝑙
= log(𝑝 πœƒ(π‘₯𝑖|𝑧 𝑖
)
κ°€ λœλ‹€. μ—¬κΈ°μ„œ 𝑝 πœƒ π‘₯𝑖 𝑧 𝑖
의 뢄포가 Bernoulli 인 κ²½μš°μ— μ•„λž˜μ˜ 식을
얻을 수 μžˆλ‹€.
좜처 : https://github.com/hwalsuklee/tensorflow-mnist-VAE
Log Likelihood
β€’ 𝑝 πœƒ π‘₯𝑖 𝑧 𝑖 의 뢄포가 normal 인 κ²½μš°μ— μ•„λž˜μ˜ 식을 얻을 수 μžˆλ‹€.
좜처 : https://github.com/hwalsuklee/tensorflow-mnist-VAE
Structures
좜처 : https://github.com/hwalsuklee/tensorflow-mnist-VAE
Structures
좜처 : https://github.com/hwalsuklee/tensorflow-mnist-VAE
Examples of VAE
좜처 : https://github.com/hwalsuklee/tensorflow-mnist-VAE
Examples of DAE
좜처 : https://github.com/hwalsuklee/tensorflow-mnist-VAE
References
β€’ [VLBM] Extracting and composing robust features with denoising
autoencoders, Vincent, Larochelle, Bengio, Manzagol, 2008
β€’ [KW] Auto-Encoding Variational Bayes, Diederik P Kingma, Max
Welling, 2013
β€’ [D] Tutorial on Variational Autoencoders - Carl Doersch, 2016
β€’ PR-010: Auto-Encoding Variational Bayes, ICLR 2014, μ°¨μ€€λ²”
β€’ μ˜€ν† μΈμ½”λ”μ˜ λͺ¨λ“  것, μ΄ν™œμ„
https://github.com/hwalsuklee/tensorflow-mnist-VAE

More Related Content

Auto-Encoders and Variational Auto-Encoders

  • 2. In this chapter, we will discuss about β€’ What is Autoencoder - neural networks whose dimension of input and output are same - if the autoencoder use only linear activations and the cost function is MES, then it is same to PCA - the architecture of a stacked autoencoder is typically symmetrical
  • 3. Contents β€’ Autoencoders β€’ Denoising Autoencoders β€’ Variational Autoencoders
  • 5. Introduction β€’ Autoencoder λŠ” encoder 와 decoder 둜 λ‚˜λ‰œλ‹€. 였λ₯Έμͺ½ diagram μ—μ„œ 𝑓 κ°€ encoder 이고, 𝑔 κ°€ decoder 이닀. β€’ cost function 은 reconstruction error λ₯Ό μ‚¬μš©ν•œλ‹€: 𝐿(π‘₯, 𝑔 𝑓 π‘₯ = π‘₯ βˆ’ 𝑔 𝑓 π‘₯ 2 β€’ hidden layer κ°€ 2개 이상인 AEλ₯Ό stacked Autoencoder 라고 ν•œλ‹€. β€’ e.g. linear activation 만 μ‚¬μš©ν•˜κ³ , cost function 으둜 MSE λ₯Ό μ‚¬μš© ν•œ Autoencoder λŠ” PCA 이닀. π‘₯ β„Ž π‘Ÿ 𝑓 𝑔
  • 6. Introduction β€’ 보톡 hidden layer κ°œμˆ˜λŠ” ν™€μˆ˜μ΄λ©° AE의 μ •μ˜μ— μ˜ν•΄ center layer(=coding layer)λ₯Ό κΈ°μ€€μœΌλ‘œ μ–‘μͺ½μ˜ unit κ°œμˆ˜κ°€ κ°™λ‹€. Coding layer Encoder Decoder
  • 7. Introduction βˆ™ μ•„λž˜μ˜ 그림처럼 symmetric ν•˜μ§€ μ•ŠλŠ” Autoencoder 도 μžˆλ‹€. βˆ™ 보톡은 encoder λΆ€λΆ„μ˜ λ§ˆμ§€λ§‰ layer 의 unit 의 κ°œμˆ˜λŠ” input layer의 dimension 보닀 μž‘λ‹€. 이런 Autoencoder λ₯Ό undercomplete 라고 ν•˜κ³  λ°˜λŒ€μ˜ 경우λ₯Ό overcomplete 라고 ν•œλ‹€. * overcomplete 이면 inclusion + projection 으둜 loss λ₯Ό 0 으둜 λ§Œλ“€ 수 있음
  • 8. Cost function β€’ [VLBM, p.2] An autoencoder takes an input vector π‘₯ ∈ 0,1 𝑑 , and first maps it to a hidden representation y ∈ 0,1 𝑑′ through a deterministic mapping 𝑦 = π‘“πœƒ π‘₯ = 𝑠(π‘Šπ‘₯ + 𝑏), parameterized by πœƒ = {π‘Š, 𝑏}. π‘Š is a 𝑑 Γ— 𝑑′ weight matrix and 𝒃 is a bias vector. The resulting latent representation 𝑦 is then mapped back to a β€œreconstructed” vector z ∈ 0,1 𝑑 in input space 𝑧 = 𝑔 πœƒβ€² 𝑦 = 𝑠 π‘Šβ€² 𝑦 + 𝑏′ with πœƒβ€² = {π‘Šβ€², 𝑏′}. The weight matrix π‘Šβ€² of the reverse mapping may optionally be constrained by π‘Šβ€² = π‘Š 𝑇, in which case the autoencoder is said to have tied weights. β€’ The parameters of this model are optimized to minimize the average reconstruction error: πœƒβˆ— , πœƒβ€²βˆ— = arg min πœƒ,πœƒβ€² 1 𝑛 𝑖=1 𝑛 𝐿(π‘₯ 𝑖 , 𝑧 𝑖 ) = arg min πœƒ,πœƒβ€² 1 𝑛 𝑖=1 𝑛 𝐿(π‘₯ 𝑖 , 𝑔 πœƒβ€²(π‘“πœƒ π‘₯ 𝑖 ) (1) where 𝐿 π‘₯ βˆ’ 𝑧 = π‘₯ βˆ’ 𝑧 2.
  • 9. Cost function β€’ [VLBM,p.2] An alternative loss, suggested by the interpretation of π‘₯ and 𝑧 as either bit vectors or vectors of bit probabilities (Bernoullis) is the reconstruction cross-entropy: 𝐿 𝐻 π‘₯, 𝑧 = 𝐻(𝐡π‘₯||𝐡𝑧) = βˆ’ π‘˜=1 𝑑 [π‘₯ π‘˜ log 𝑧 π‘˜ + log 1 βˆ’ π‘₯ π‘˜ log(1 βˆ’ 𝑧 π‘˜)] where π΅πœ‡ π‘₯ = (π΅πœ‡1 π‘₯ , β‹― , π΅πœ‡ 𝑑 π‘₯ ) is a Bernoulli distribution. β€’ [VLBM,p.2] Equation (1) with 𝐿 = 𝐿 𝐻 can be written πœƒβˆ—, πœƒβ€²βˆ— = arg min πœƒ,πœƒβ€² 𝐸 π‘ž0 [𝐿 𝐻(𝑋, 𝑔 πœƒβ€²(π‘“πœƒ 𝑋 ))] where π‘ž0(𝑋) denotes the empirical distribution associated to our 𝑛 training inputs.
  • 11. Denoising Autoencoders β€’ [VLBM,p.3] corrupted input λ₯Ό λ„£μ–΄ repaired input 을 μ°ΎλŠ” training 을 ν•œλ‹€. μ’€ 더 μ •ν™•ν•˜κ²Œ λ§ν•˜λ©΄ input data의 dimension 이 𝑑 라고 ν•  λ•Œ β€˜desired proportion 𝜈 of destruction’ 을 μ •ν•˜μ—¬ πœˆπ‘‘ 만큼의 input 을 0 으둜 μˆ˜μ •ν•˜λŠ” λ°©μ‹μœΌλ‘œ destruction ν•œλ‹€. μ—¬κΈ°μ„œ πœˆπ‘‘ 개 components λŠ” random 으둜 μ •ν•œλ‹€. 이후 reconstruction error λ₯Ό μ²΄ν¬ν•˜μ—¬ destruction μ „μ˜ input 으둜 λ³΅κ΅¬ν•˜λŠ” cost function μ΄μš©ν•˜μ—¬ ν•™μŠ΅μ„ ν•˜λŠ” Autoencoder κ°€ Denoising Autoencoder 이닀. μ—¬κΈ°μ„œ π‘₯ μ—μ„œ destroyed version 인 π‘₯ λ₯Ό μ–»λŠ” 과정은 stochastic mapping π‘₯ ~ π‘ž 𝐷( π‘₯|x) λ₯Ό λ”°λ₯Έλ‹€.
  • 12. Denoising Autoencoders β€’ [VLBM,p.3] Let us define the joint distribution π‘ž0 𝑋, 𝑋, π‘Œ = π‘ž0 𝑋 π‘ž 𝐷 𝑋 𝑋 𝛿𝑓 πœƒ 𝑋 (π‘Œ)) where 𝛿 𝑒(𝑣) is the Kronecker delta. Thus π‘Œ is a deterministic function of 𝑋. π‘ž0(𝑋, 𝑋, π‘Œ) is parameterized by πœƒ. The objective function minimized by stochastic gradient descent becomes: π‘Žπ‘Ÿπ‘” π‘šπ‘–π‘› πœƒ,πœƒβ€² 𝐸 π‘ž0 π‘₯, π‘₯ [𝐿 𝐻(π‘₯, 𝑔 πœƒβ€²(π‘“πœƒ( π‘₯))]) Corrupted π‘₯ λ₯Ό μž…λ ₯ν•˜κ³  π‘₯ λ₯Ό μ°ΎλŠ” λ°©λ²•μœΌλ‘œ ν•™μŠ΅!
  • 13. Other Autoencoders β€’ [DL book, 14.2.1] A sparse autoencoder is simply an autoencoder whose training criterion involves a sparsity penalty Ξ© β„Ž on the code layer h, in addition to the reconstruction error: 𝐿(π‘₯, 𝑔 𝑓 π‘₯ ) + Ξ©(β„Ž), where 𝑔 β„Ž is the decoder output and typically we have β„Ž = 𝑓(π‘₯), the encoder output. β€’ [DL book, 14.2.3] Another strategy for regularizing an autoencoder is to use a penalty Ξ© as in sparse autoencoders, 𝐿 π‘₯, 𝑔 𝑓 π‘₯ ) + Ξ©(β„Ž, π‘₯ , but with a different form of Ξ©: Ξ© β„Ž, π‘₯ = πœ† 𝑖 𝛻π‘₯β„Žπ‘– 2
  • 15. Reference β€’ Reference. Auto-Encoding Variational Bayes, Diederik P Kingma, Max Welling, 2013 β€’ [λ…Όλ¬Έμ˜ κ°€μ •] We will restrict ourselves here to the common case where we have an i.i.d. dataset with latent variables per datapoint, and where we like to perform maximum likelihood (ML) or maximum a posteriori (MAP) inference on the (global) parameters, and variational inference on the latent variables.
  • 16. Definition β€’ Generative Model 의 λͺ©ν‘œ - 𝑋 = {π‘₯𝑖} λ₯Ό μƒμ„±ν•˜λŠ” 집합 𝑍 = {𝑧𝑗} 와 ν•¨μˆ˜ πœƒ λ₯Ό μ°ΎλŠ” 것이 λͺ©ν‘œμ΄λ‹€. i.e. Finding arg min 𝑍,πœƒ 𝑑(π‘₯, πœƒ(𝑧)) where 𝑑 is a metric β€’ VAE 도 generative model μ΄λ―€λ‘œ 집합 𝑍 와 ν•¨μˆ˜ πœƒ κ°€ 유기적으둜 λ™μž‘ν•˜μ—¬ 쒋은 λͺ¨λΈμ„ λ§Œλ“€κ²Œ λ˜λŠ”λ° VAEλŠ” Latent variable 𝑧 κ°€ parametrized distribution(by πœ™) μ—μ„œ λ‚˜μ˜¨λ‹€ κ³  κ°€μ •ν•˜κ³  π‘₯ λ₯Ό 잘 μƒμ„±ν•˜λŠ” parameter πœƒ λ₯Ό ν•™μŠ΅ν•˜κ²Œ λœλ‹€. Figure 1 z π‘₯ πœƒ
  • 17. Problem scenario β€’ Dataset 𝑋 = π‘₯ 𝑖 𝑖=1 𝑁 λŠ” i.i.d 인 continuous(λ˜λŠ” discrete) variable π‘₯ 의 sample 이닀. 𝑋 λŠ” unobserved continuous variable 𝑧 의 some random process 둜 μƒμ„±λ˜μ—ˆλ‹€κ³  κ°€μ •ν•˜μž. μ—¬κΈ°μ„œ random process λŠ” 두 개의 step 으둜 κ΅¬μ„±λ˜μ–΄μžˆλ‹€: (1) 𝑧(𝑖) is generated from some prior distribution 𝑝 πœƒβˆ—(𝑧) (2) π‘₯(𝑖) is generated from some conditional distribution 𝑝 πœƒβˆ—(π‘₯|𝑧) β€’ Prior 𝑝 πœƒβˆ— 𝑧 와 likelihood 𝑝 πœƒβˆ—(π‘₯|𝑧) λŠ” differentiable almost everywhere w.r.t πœƒ and 𝑧 인 parametric families of distributions 𝑝 πœƒ(𝑧)와 𝑝 πœƒ(π‘₯|𝑧) λ“€μ˜ μ—μ„œ 온 κ²ƒμœΌλ‘œ κ°€μ •ν•˜μž. β€’ μ•„μ‰½κ²Œλ„ true parameter πœƒβˆ— 와 latent variables 𝑧(𝑖) 의 값을 μ•Œ 수 μ—†μ–΄μ„œ cost function 을 μ •ν•˜κ³  κ·Έκ²ƒμ˜ lower bound λ₯Ό κ΅¬ν•˜λŠ” λ°©ν–₯으둜 μ „κ°œλ  μ˜ˆμ •μ΄λ‹€.
  • 18. Intractibility and Variational Inference β€’ 𝑝 πœƒ π‘₯ = 𝑝 πœƒ π‘₯ 𝑝 πœƒ π‘₯ 𝑧 𝑑𝑧 is intractable(계산 λΆˆκ°€λŠ₯) ∡ 𝑝 πœƒ 𝑧 π‘₯ = 𝑝 πœƒ π‘₯ 𝑧 𝑝 πœƒ(𝑧)/𝑝 πœƒ(π‘₯) is intractable β€’ 𝑝 πœƒ 𝑧 π‘₯ λ₯Ό μ•Œ 수 μ—†μœΌλ‹ˆ μš°λ¦¬κ°€ μ•„λŠ” ν•¨μˆ˜λΌκ³  κ°€μ •ν•˜μž. 이런 방법을 variational inference 라고 ν•œλ‹€. 즉, 잘 μ•„λŠ” ν•¨μˆ˜ π‘ž πœ™(𝑧|π‘₯) λ₯Ό 𝑝 πœƒ 𝑧 π‘₯ λŒ€μ‹  μ‚¬μš©ν•˜λŠ” 방법을 variational inference 라고 ν•œλ‹€. β€’ Idea : prior 𝑝 πœƒ 𝑧 π‘₯ λ₯Ό π‘ž πœ™(𝑧|π‘₯) 둜 μ‚¬μš©ν•΄λ„ λ˜λŠ” μ΄μœ λŠ” 주어진 input 𝑧 κ°€ π‘₯ 에 κ·Όμ‚¬ν•˜κ²Œ ν•™μŠ΅μ΄ λœλ‹€.
  • 19. Then variational bound β€’ μš°λ¦¬λŠ” 𝑋 = π‘₯ 𝑖 𝑖=1 𝑁 κ°€ i.i.d λ₯Ό λ§Œμ‘±ν•œλ‹€κ³  κ°€μ •ν–ˆμœΌλ―€λ‘œ log 𝑝 πœƒ(π‘₯ 1 , β‹― , π‘₯(𝑁) ) = 𝑖=1 𝑁 log 𝑝 πœƒ(π‘₯ 𝑖 ) 이닀. β€’ 각각의 π‘₯ ∈ 𝑋에 λŒ€ν•΄ μ•„λž˜ 식을 얻을 수 μžˆλ‹€: log 𝑝 πœƒ π‘₯ = 𝑧 π‘ž πœ™ 𝑧 π‘₯ log 𝑝 πœƒ(π‘₯) 𝑑𝑧 = 𝑧 π‘ž πœ™ 𝑧 π‘₯ log 𝑝 πœƒ 𝑧,π‘₯ 𝑝 πœƒ(𝑧|π‘₯) 𝑑𝑧 = 𝑧 π‘ž πœ™ 𝑧 π‘₯ log 𝑝 πœƒ 𝑧,π‘₯ π‘ž πœ™ 𝑧 π‘₯ π‘ž πœ™ 𝑧 π‘₯ 𝑝 πœƒ 𝑧 π‘₯ 𝑑𝑧 = 𝑧 π‘ž πœ™ 𝑧 π‘₯ log 𝑝 πœƒ 𝑧,π‘₯ π‘ž πœ™ 𝑧 π‘₯ 𝑑𝑧 + 𝑧 π‘ž πœ™ 𝑧 π‘₯ π‘ž πœ™ 𝑧 π‘₯ 𝑝 πœƒ 𝑧 π‘₯ 𝑑𝑧 = 𝓛 πœƒ, πœ™; π‘₯ + 𝐷 𝐾𝐿(π‘ž πœƒ(𝑧|π‘₯)||𝑝 πœƒ 𝑧 π‘₯ ) (1) β‰₯ 𝓛 πœƒ, πœ™; π‘₯ μ–Έμ œλ‚˜ 𝐾𝐿 divergence λŠ” 0 μ΄μƒμ΄λ―€λ‘œ 𝑧 π‘ž πœ™ 𝑧 π‘₯ log 𝑝 πœƒ 𝑧,π‘₯ π‘ž πœ™ 𝑧 π‘₯ 𝑑𝑧 λŠ” lower bound κ°€ λœλ‹€. 이것 을 𝓛 πœƒ, πœ™; π‘₯ 라고 λ†“μž.
  • 20. Cost function β€’ (Eq.3) 𝓛 πœƒ, πœ™; π‘₯ = βˆ’π· 𝐾𝐿(π‘ž πœ™(𝑧|π‘₯)||𝑝 πœƒ π‘₯ ) + 𝐸 π‘ž πœ™ 𝑧 π‘₯ [log 𝑝 πœƒ(π‘₯|𝑧)] <Proof> 𝓛 πœƒ, πœ™; π‘₯ = 𝑧 π‘ž πœ™ 𝑧 π‘₯ log 𝑝 πœƒ(𝑧,π‘₯) π‘ž πœ™(𝑧|π‘₯) 𝑑𝑧 = 𝑧 π‘ž πœ™ 𝑧 π‘₯ log 𝑝 πœƒ 𝑧)𝑝 πœƒ(π‘₯|𝑧 π‘ž πœ™ (𝑧|π‘₯) 𝑑𝑧 = 𝑧 π‘ž πœ™ 𝑧 π‘₯ log 𝑝 πœƒ(𝑧) π‘ž πœ™ (𝑧|π‘₯) 𝑑𝑧 + 𝑧 π‘ž πœ™ 𝑧 π‘₯ log 𝑝 πœƒ π‘₯ 𝑧 𝑑𝑧 = βˆ’π· 𝐾𝐿(π‘ž πœ™(𝑧|π‘₯)| 𝑝 πœƒ(𝑧) + 𝐸 π‘ž πœ™ 𝑧 π‘₯ [log 𝑝 πœƒ(π‘₯|𝑧)] π‘ž πœ™μ™€ 𝑝 πœƒκ°€ normal 이면 계산 κ°€λŠ₯! Reconstruction error
  • 21. Cost function β€’ Lemma. If 𝑝 π‘₯ ~𝑁 πœ‡1, 𝜎1 2 and π‘ž π‘₯ ~𝑁 πœ‡2, 𝜎2 2 , then 𝐾𝐿(𝑝(π‘₯)||π‘ž π‘₯ ) = ln 𝜎2 𝜎1 + 𝜎1 2 + πœ‡1 βˆ’ πœ‡2 2 2𝜎2 2 βˆ’ 1 2 β€’ Corollary. π‘ž πœ™ 𝑧 π‘₯ ~ 𝑁(πœ‡π‘–, πœŽπ‘– 2 𝐼) 이고 𝑝 𝑧 ~𝑁 0,1 이면 𝐾𝐿 π‘ž πœ™ 𝑧 π‘₯𝑖 𝑝 𝑧 = 1 2 (π‘‘π‘Ÿ πœŽπ‘– 2 𝐼 + πœ‡π‘– 𝑖 π‘‡πœ‡ βˆ’ 𝐽 + ln 1 𝑗=𝑖 𝐽 πœŽπ‘–,𝑗 2 ) = (Σ𝑗=1 𝐽 πœŽπ‘–,𝑗 2 + Σ𝑗=1 𝐽 πœ‡π‘–,𝑗 2 βˆ’ 𝐽 βˆ’ Σ𝑗=1 𝐽 ln(πœŽπ‘–,𝑗 2 )) = 1 2 Σ𝑗=1 𝐽 (πœ‡π‘–,𝑗 2 + πœŽπ‘–,𝑗 2 βˆ’ 1 βˆ’ ln πœŽπ‘–,𝑗 2 ) β€’ 즉, π‘ž πœ™ 𝑧 π‘₯ ~ 𝑁(πœ‡π‘–, πœŽπ‘– 2 𝐼) 이고 𝑝 𝑧 ~𝑁 0,1 이면 Eq.3 λŠ” μ•„λž˜μ™€ κ°™λ‹€. 𝓛 πœƒ, πœ™; π‘₯ = 1 2 Σ𝑗=1 𝐽 πœ‡π‘–,𝑗 2 + πœŽπ‘–,𝑗 2 βˆ’ 1 βˆ’ ln πœŽπ‘–,𝑗 2 + 𝐸 π‘ž πœ™ 𝑧 π‘₯ [log 𝑝 πœƒ(π‘₯|𝑧)]
  • 22. Reconstruction error ν•™μŠ΅ 방법 β€’ 𝐸 π‘ž πœ™ 𝑧 π‘₯ [log 𝑝 πœƒ(π‘₯|𝑧)]λŠ” sampling 을 톡해 Monte-carlo estimation ν•œλ‹€. 즉, π‘₯𝑖 ∈ X λ§ˆλ‹€ 𝑧 𝑖,1 , β‹― , 𝑧 𝑖 𝐿 λ₯Ό sampling ν•˜μ—¬ log likelihood 의 mean으둜 κ·Όμ‚¬μ‹œν‚¨λ‹€. 보톡 𝐿 = 1 을 많이 μ‚¬μš©ν•œλ‹€. 𝐸 π‘ž πœ™ 𝑧 π‘₯ [log 𝑝 πœƒ(π‘₯|𝑧)] ∼ 1 𝐿 Σ𝑙=1 𝐿 log(𝑝 πœƒ π‘₯𝑖 𝑧 𝑖,𝑙 ) β€’ μ΄λ ‡κ²Œ sampling 을 ν•˜λ©΄ backpropagation 을 ν•  수 μ—†λ‹€. κ·Έλž˜μ„œ μ‚¬μš© λ˜λŠ” 방법이 reparametrization trick 이닀. π‘ž πœ™ π‘₯ 𝑧 𝑝 πœƒ π‘₯ 𝑧
  • 23. Reparametrization trick β€’ Reparametrization trick : sampling μ—μ„œ μƒκΈ°λŠ” backpropagation λΆˆκ°€λŠ₯ 문제λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•œ 방법. 𝑧~𝑁 πœ‡, 𝜎2 β‡’ 𝑧 = πœ‡ + πœŽπœ– where πœ–~𝑁(0,1)
  • 24. Log Likelihood β€’ μ•žμ—μ„œ reconstruction error λ₯Ό 계산할 λ•Œ ν•˜λ‚˜μ”© sampling 을 ν•˜λ©΄ 𝐸 π‘ž πœ™ 𝑧 π‘₯ [log 𝑝 πœƒ(π‘₯|𝑧)] ∼ 1 𝐿 Σ𝑙=1 𝐿 log 𝑝 πœƒ π‘₯𝑖 𝑧 𝑖,𝑙 = log(𝑝 πœƒ(π‘₯𝑖|𝑧 𝑖 ) κ°€ λœλ‹€. μ—¬κΈ°μ„œ 𝑝 πœƒ π‘₯𝑖 𝑧 𝑖 의 뢄포가 Bernoulli 인 κ²½μš°μ— μ•„λž˜μ˜ 식을 얻을 수 μžˆλ‹€. 좜처 : https://github.com/hwalsuklee/tensorflow-mnist-VAE
  • 25. Log Likelihood β€’ 𝑝 πœƒ π‘₯𝑖 𝑧 𝑖 의 뢄포가 normal 인 κ²½μš°μ— μ•„λž˜μ˜ 식을 얻을 수 μžˆλ‹€. 좜처 : https://github.com/hwalsuklee/tensorflow-mnist-VAE
  • 28. Examples of VAE 좜처 : https://github.com/hwalsuklee/tensorflow-mnist-VAE
  • 29. Examples of DAE 좜처 : https://github.com/hwalsuklee/tensorflow-mnist-VAE
  • 30. References β€’ [VLBM] Extracting and composing robust features with denoising autoencoders, Vincent, Larochelle, Bengio, Manzagol, 2008 β€’ [KW] Auto-Encoding Variational Bayes, Diederik P Kingma, Max Welling, 2013 β€’ [D] Tutorial on Variational Autoencoders - Carl Doersch, 2016 β€’ PR-010: Auto-Encoding Variational Bayes, ICLR 2014, μ°¨μ€€λ²” β€’ μ˜€ν† μΈμ½”λ”μ˜ λͺ¨λ“  것, μ΄ν™œμ„ https://github.com/hwalsuklee/tensorflow-mnist-VAE