machine-learning deep-learning convolution autoencoder unsupervised-learning

Why flatten last encoder layer in a convolutional VAE?

I am quite new in the deep learning game, I was wondering why do we flatten the last layer of the encoder in a VAE and then give the flattened output to a linear layer, which then approximates a location and scale parameter for the prior? Can't we just split the output of a convolutional layer and get the location and scale from here directly, or do the spatial information captured by a convolution mess up the scale and location?

Thanks a lot!

Solution

Why do we flatten the last layer of the encoder in a VAE?

There isn't really a good reason other than to make it convenient for printing or reporting. If right before flattening the encoder is of shape [BatchSize,2,2,32] , flattening it to [BatchSize,128] just makes it handy to just list all 128 encoded values per sample. When the decoder then reshapes it to [BatchSize,2,2,32] all the spacial information is put back where it was. No spacial information was lost.

Of course, one may decide to use the encoder of a trained VAE as an image feature extractor. This is actually very useful when we have a LOT of unlabeled images to train a VAE with, but only a few labeled images. After training the VAE on the large unlabeled image set, the encoder effectively becomes a feature extractor. We can then feed the feature extractor into a dense layer whos purpose is to learn the labels. Having the encoder output a flattened data set is very useful in this situation.