Is the code proper way of understanding Vae vs. Standard Autoencoder?

I have created two mini encoding networks for Standard autoencoder and VAE and ploted each. Would just like to know if my understanding is correct for this mini case. Note it's only one epoch and it ends with encoding.

import numpy as np
from matplotlib import pyplot as plt

np.random.seed(0)

fig, (ax,ax2) = plt.subplots(2,1)

def relu(x):
    c = np.where(x>0,x,0)
    return c


#Standard autoencoder
x = np.random.randint(0,2,[100,5])
w_autoencoder = np.random.normal(0,1,[5,2])


bottle_neck = relu(x.dot(w_autoencoder))

ax.scatter(bottle_neck[:,0],bottle_neck[:,1])

#VAE autoencoder

w_vae1 = np.random.normal(0,1,[5,2])
w_vae2 = np.random.normal(0,1,[5,2])

mu = relu(x.dot(w_vae1))
sigma = relu(x.dot(w_vae2))

epsilon_sample = np.random.normal(0,1,[100,2])

latent_space = mu+np.log2(sigma)*epsilon_sample

ax2.scatter(latent_space[:,0], latent_space[:,1],c='red')

w_vae1 = np.random.normal(0,1,[5,2])
w_vae2 = np.random.normal(0,1,[5,2])

mu = relu(x.dot(w_vae1))
sigma = relu(x.dot(w_vae2))

epsilon_sample = np.random.normal(0,1,[100,2])

latent_space = mu+np.log2(sigma)*epsilon_sample

ax2.scatter(latent_space[:,0], latent_space[:,1],c='red')

Solution

Since your motive is "understanding", I should say you are in the right direction and working on this sort of implementation definitely helps you in understanding. But I strongly believe "understanding" has to be achieved first in books/papers and only then via the implementation/code.

On a quick glance, your standard autoencoder looks fine. You are making an assumption via your implementation that your latent code would be in the range of (0,infinity) using relu(x).

However, while doing the implementation of VAE, you can't achieve the latent code with relu(x) function. This is where your "theoretical" understanding is missing. In standard VAE, we make an assumption that the latent code is a sample from a Gaussian distribution and as such we approximate the parameters of that Gaussian distribution i.e. mean and covariance. Further, we also make another assumption that this Gaussian distribution is factorial which means the covariance matrix is diagonal. In your implementation, you are approximating the mean and diagonal covariance as:

mu = relu(x.dot(w_vae1))
sigma = relu(x.dot(w_vae2))

which seems fine but while getting sample (reparameterization trick), not sure why you introduced np.log2(). Since you are using ReLU() activation, you may end up with 0 in your sigma variable and when you do np.log2(0), you will get inf. I believe you were motivated by some available code where they do:

mu = relu(x.dot(w_vae1)) #same as yours
logOfSigma = x.dot(w_vae2) #you are forcing your network to learn log(sigma)

Now since you are approximating log of sigma, you can allow your output to be negative because to get the sigma, you would do something like np.exp(logOfSigma) and this would ensure you would always get positive values in your diagonal covariance matrix. Now to do sampling, you can simply do:

latent_code = mu + np.exp(logOfSigma)*epsilon_sample

Hope this helps!