machine-learning deep-learning neural-network artificial-intelligence

Back propagation from decoder input to encoder output in variational autoencoder

I am trying to understand VAE in-depth by implementing it by myself and having difficulties when back-propagate losses of the decoder input layer to the encoder output layer.

VAE

My encoder network outputs 8 pairs (sigma, mu) which I then combine with the result of a stochastic sampler to produce 8 input values (z) for the decoder network:

decoder_in = sigma * N(0,I) + mu

Then I run forward propagation for the decoder network, compute MSE reconstruction loss and back-propagate weights, and losses up to the decoder input layer.

Here I stuck completely since there is no comprehensible explanation of how to back-propagate losses from the decoder input layer to the encoder output layer.

My best idea was to store the results of sampling from N(0,I) to (epsilon) and use them in such a way:

L(sigma) = epsilon * dLz(decoder_in)
L(mu) = 1.0 * dLz(decoder_in)

It kind of works, but in the long run the sigma components of the encoded vector of distributions tend to regress to zeroes, so my VAE as a result also regressed to AE.

Also, I still have no clue how to integrate KL-loss in this scheme. Should I add it to the encoder loss or somehow combine it with the decoder MSE loss?

Solution

The VAE does not use the reconstruction error as the cost objective if you use that the model just turns back into an autoencoder. The VAE uses the variational lower bound and a couple of neat tricks to make it easy to compute.

Referring to the original “auto-encoding variational bayes” paper

The variational lower bound objective is (eq 10):

1/2( d+log(sigmaTsigma) -(muTmu) - (sigmaTsigma)) + log p(x/z)

Where d is number of latent variable, mu and sigma is the output of the encoding neural network used to scale the standard normal samples and z is the encoded sample. p(x/z) is just the decoder probability of generating back the input x.

All the variables in the above equation are completely differentiable and hence can be optimized with gradient descent or any other gradient based optimizer you find in tensorflow