Last update: 16 February 2020

Generative models are a class of statistical models that are able generate new data points. They have a variety of applications and they are really fun to play with.

In this article, we're gonna explore one type of the generative models called **Variational autoencoder (VAE)**

Before diving into VAE, let's first understand what autencoders are.

Autoencoders is an unsupervised learning approach that aims to learn lower dimensional features representation of the data. This is achieved by training a neural network to reconstruct the original data by placing some constraints on the architecture.

Given enough capacity, the autoencoders can learn the identity function and miss to capture the most important features of the data. So to prevent this from happening, one solution would be to introduce **regularization**.

Let's explore some used techniques.

The sparsity can be achieved by different methods, the most used ones are :

**Using KL divergence as a regularization term**

KL divergence measures how two probabilistic distributions are different from each other. We're gonna use this property to add regularization to our network, and this is how we do it:

We define a sparsity parameter \(\pmb{\rho}\) that typically takes small values \(\pmb{\rho} = 0.01 \) .

Then we calculate the average activation of the hidden unit \( j \) that we call \( \pmb{\widehat{\rho}_{j}} \)

Then we calculate the average activation of the hidden unit \( j \) that we call \( \pmb{\widehat{\rho}_{j}} \)

$$ \hat{\rho_j} = \frac{1}{m} \sum_{i=1}^m h_j(x^{(i)}) $$

The goal is to enforce:

$$ \hat{\rho_j} = \rho $$

To achieve this, we're gonna add an extra term to the total loss that penalizes \( \widehat{\rho}_j \) that deviate significalty from \( \rho \)

$$ L = \mathcal{L}(x, {x'} ) + \lambda\sum_{j}KL(\rho \| \hat{\rho_j}) $$

**Applying L1 or L2 regularization**

$$ L = \mathcal{L}(x, {x'} ) +\lambda\sum_{j} h_j ^2 $$

$$ L = \mathcal{L}(x, {x'} ) +\lambda\sum_{j} |h_j| $$

**Using Relu activation function for the hidden layer**

**Using dropout**

With denoising autoencoder, rather than adding a regularization term, the network is trained to recover the original undistorted input from a partially corrupted input. This force the network learn the useful features.

Use Frobenius norm of the Jacobian matrix of the encoder with respect to the input.

$$ L = \mathcal{L}(x, {x'} ) +\lambda\sum_{j} || \nabla_xh_j ||^2 $$

\( \mathcal{L}(x, x') \) is often L2 loss.

Autoencoders have a variety of applications, among those we can find:

**Dimensionality reduction**: As the autoencoders are good at learning the useful and the important features of the data, they are indeed a good means of reducing the dimensions of the input data.**Features extractor**: Autoencoders can be used as a features extractor for supervised learning models. For example, by removing the decoder part and plugin a classifier on top of the encoder, we can build a good classifier neural network.

VAE are based on two important assumptions:

- There is out there a prior distribution that generate latent state z that we call \(p_\theta(z)\) where \(\theta\) are the parameters of the distribution.
- There is a conditional distribution that generates data \( p_\theta(x|z) \)

The goal now would be to estimate the parameters of the distribution without having access to latent state z.

For simplicity we're gonna assume that:

- The prior distribution \( p_\theta(z) \) is a unit Gaussian
- The conditional distribution \( p_\theta(x|z) \) is a diagonal Gaussian. The mean and variance are estimated with neural network.

To estimate the latent state z from input data x, we could use the Bayes rule:

$$ p_\theta(z|x) = \frac{p_\theta(x|z)p_\theta(z)}{p_\theta(x)} $$

The problem we get here is that \( p_\theta(x) \) is intractable integral. So, to solve this, we're gonna use neural network to estimate a distribution over latent states that we call \( q_\phi(z|x) \) then we sample over it to get the latent state \( z\).

After the training, we throw away the encoder part and we sample from the latent space \( z \) to generate new samples.

In the next part, we'll see how to train VAEs to generate new data, stay tuned !