4. ¤ autoencoder [Ngiam+ 11]
¤ AE
¤ 2
¤ DBM [Srivastava+ 12]
¤ deep Boltzmann machine
¤ AE
...
...
Hidden Units
Audio Input
(a) Audio RBM
...
...
Hidden Units
Video Input
(b) Video RBM
…...
...
Audio Input
Shared Representation
...
Video Input
(c) Shallow Bimodal RBM
...
...
... ...
Audio Input Video Input
...
(d) Bimodal DBN
Figure 2: RBM Pretraining Models. We train RBMs for (a) audio and (b) video separately as
a baseline. The shallow model (c) is limited and we ?nd that this model is unable to capture
correlations across the modalities. The bimodal deep belief network (DBN) model (d) is trained
in a greedy layer-wise fashion by ?rst training models (a) & (b). We later “unroll” the deep
model (d) to train the deep autoencoder models presented in Figure 3.
...
...
...
...
... ...
...
Video Input
Shared
Representation
Audio Reconstruction Video Reconstruction
(a) Video-Only Deep Autoencoder
...
...
... ...
...
...
... ...
...
Audio Input Video Input
Shared
Representation
Audio Reconstruction Video Reconstruction
(b) Bimodal Deep Autoencoder
Figure 3: Deep Autoencoder Models. A “video-only” model is shown in (a) where the model
learns to reconstruct both modalities given only video as the input. A similar model can be
drawn for the “audio-only” setting. We train the (b) bimodal deep autoencoder in a denoising
fashion, using an augmented dataset with examples that require the network to reconstruct both
modalities given only one. Both models are pre-trained using sparse RBMs (Figure 2d). Since
we use a sigmoid transfer function in the deep network, we can initialize the network using the
conditional probability distributions p(h|v) and p(v|h) of the learned RBM.
Therefore, we consider greedily training a RBM over
the pre-trained layers for each modality, as motivated
by deep learning methods (Figure 2d).2
In particular,
the posteriors (Equation 2) of the ?rst layer hidden
variables are used as the training data for the new
layer. By representing the data through learned ?rst
layer representations, it can be easier for the model to
learn higher-order correlations across modalities. In-
formally, the ?rst layer representations correspond to
phonemes and visemes and the second layer models the
relationships between them. Figure 4 shows visualiza-
ties; it is possible for the model to ?nd representations
such that some hidden units are tuned only for au-
dio while others are tuned only for video. Second, the
models are clumsy to use in a cross modality learn-
ing setting where only one modality is present during
supervised training and testing. With only a single
modality present, one would need to integrate out the
unobserved visible variables to perform inference.
Thus, we propose a deep autoencoder that resolves
both issues. We ?rst consider the cross modality learn-
ing setting where both modalities are present during
Image-speci?c DBM Text-speci?c DBM
Multimodal DBM
Figure 2: Left: Image-speci?c two-layer DBM that uses a Gaussian model to model the distribution over real-
valued image features. Middle: Text-speci?c two-layer DBM that uses a Replicated Softmax model to model
its distribution over the word count vectors. Right: A Multimodal DBM that models the joint distribution over
image and text inputs.
We illustrate the construction of a multimodal DBM using an image-text bi-modal DBM as our
running example. Let vm 2 RD
denote an image input and vt 2 NK
denote a text input. Consider
modeling each data modality using separate two-layer DBMs (Fig. 2). The image-speci?c two-layer
DBM assigns probability to vm that is given by (ignoring bias terms on the hidden units for clarity):
P(vm; ?) =
X
h(1),h(2)
P(vm, h(1)
, h(2)
; ?) = (4)
→ deep Boltzmann machine
8. ¤ Conditional VAE (CVAE) [Kingma+ 2014][Sohn+ 2015]
¤ !(#|M)
¤
¤ Conditional multimodal autoencoder [Pandey 16]
¤ !(#|M)
¤ CVAE
¤
¤
¤ M
¤ # M !(M|#) 2
Since the conditional log-likelihood of the proposed
model is intractable, we use variational methods for train-
ing the model, whereby the posterior of the latent vari-
ables given the faces and the attributes is approximated by
a tractable distribution. While variational methods have
long been a popular tool for training graphical models,
their usage for deep learning became popular after the
reparametrization trick of [15, 21, 24]. Prior to that, mean-
?eld approximation has been used for training deep Boltz-
mann machines (DBM) [22]. However, the training of
DBM involves solving a variational approximation prob-
lem for every instance in the training data. On the other
hand, reparametrizing the posterior allows one to single
parametrized variational approximation problem for all in-
stances in the training data simultaneously.
The proposed model is referred to as conditional multi-
modal autoencoder (CMMA). We use CMMA for the task
of generating faces from attributes, and to modify faces
in the training data, by modifying the corresponding at-
tributes. The dataset used is the cropped Labelled Faces
in the Wild dataset1
(LFW) [11]. We also compare the
qualitative and quantitative performance of CMMA against
Figure 1: A graphical representation of CMMA
we wish to generate and the attributes correspond to the
modality that we wish to condition on.
A formal description of the problem is as fol-
lows. We are given an i.i.d sequence of N datapoints
{(x(1)
, y(1)
), . . . , (x(N)
, y(N)
)}. For a ?xed datapoint
(x, y), let x be the modality that we wish to generate and
y be the modality that we wish to condition on. We as-
sume that x is generated by ?rst sampling a real-valued la-
tent representation z from the distribution p(z|y), and then
sampling x from the distribution p(x|z). The graphical rep-
resentation of the model is given in Figure 1. Furthermore,
we assume that the conditional distribution of the latent rep-
resentation z given y and the distribution of x given z are
#
&M