Explaining Variational Autoencoder gaussian parameterization

In the original Auto-Encoding Variational Bayes paper, the authors describes the "reparameterization trick" in section 2.4. The trick is to breakup your latent state z into learnable mean and sigma (learned by the encoder) and adding Gaussian noise. You then sample a datapoint from z (basically you generate an encoded image) and let the decoder map the encoded datapoint back to the original image.

I have a hard getting over how strange this is. Could someone explain a bit more on the latent variable model, specifically:

Why are we assuming the latent state is Gaussian?
How is it possible that a Gaussian can generate an image?
And how does backprop corrupt the encoder to learn a Gaussian function as opposed to an unknown non-linear function?

Here is an example implementation of the latent model from here in TensorFlow.

...neural net code maps input to hidden layers z_mean and z_log_sigma

self.z_mean, self.z_log_sigma_sq = \
self._recognition_network(network_weights["weights_recog"], 
                           network_weights["biases_recog"])

# Draw one sample z from Gaussian distribution
n_z = self.network_architecture["n_z"]
eps = tf.random_normal((self.batch_size, n_z), 0, 1, 
                           dtype=tf.float32)
# z = mu + sigma*epsilon
self.z = tf.add(self.z_mean, 
                tf.mul(tf.sqrt(tf.exp(self.z_log_sigma_sq)), eps))

...neural net code maps z to output

Solution

They are not assuming that the activations of the encoder follow a gaussian distribution, they are enforcing that of the possible solutions choose a gaussian resembling one.
The image is generated from decoding a activation/feature, the activations are distributed resembling a gaussian.
They minimize the KL divergence between the activations distribution and a gaussian one.