deep-learning artificial-intelligence conv-neural-network autoencoder encoder

How do autoencoders came to know that which features are most salient?

Autoencoders are generally used for reducing the number of dimensions. It compresses the image by reducing the unnecessary number of dimensions . . They work by compressing the input into a latent-space representation, and then reconstructing the output from this representation . So , How do the autoencoders came to know that which features are most important to retain and which are unimportant to throw it . There is one more question that How do autoencoders used for extracting the features of images just like in CNN , Convolutional layers are responsible for extracting the features of images . In autoencoders , How and which layer extract the features of images ?

Solution

AutoEncoders is a particular network which tries to solve the identity problem expressed x' = g(h(x)), where h is the encoder block and g the decoder block.

The latent space z is the minimal expression for a given input x, and it resides in the middle of the network. It's valid to clarify that in such space resides different shapes and each one corresponds to certain instance given during the training phase. Using the CNN that you reffered to support this, it's like a feature map but instead a bunch of feature maps accross the network there's only one, and again, it holds different representations bassed on what it observed during the training.

So, the question is how does it happens to compress and decompress? Well, the data used for training has a domain and every instance has similarities (all the cats have the same abstract qualities, the same for mountains, all have something in common), therefore the network learns how to fit what describes the data into smaller pieces combined, and from smaller pieces (with ranges from 0-1) how to build bigger pieces.

Taking the same samples from cats, all of them have two ears, have a fur, have two eyes and so on, I didn't mentioned the deatils, but you can think on the shape of those ears, how the fur is and probably how big those eyes are, the colors and the bightness. Think on the listing that I put as the latent space z and the details as the x' output.

For more details see this exhaustive explanation with the different AE variants: https://wikidocs.net/3413.

Hope that this helps you.

EDIT 1:

How and which layer extract the features of images?

Its design:

AutoEncoder is a network with a design that makes it possible to compress and decompress the training data, is not an arbitrary network at all.

First, it has a shape of a sand-clock, meaning this that the next layer has fewer neurons that the previous in the encoder block and right after the "latent space layer(s)" it starts doing the opposite by increasing the neurons size in the decoder block until it reaches the size of the input layer (the reconstruction, therefore the output).

Next, each layer is a Dense layer, saying that all the neurons of each layer is fully plugged to the next, so, all the features are carried from layer to layer. The activation function of each neuron (ideally) is the tanh saying that all the possible outputs are [-1,1] being the case; finally, the loss function tends to be the Root Mean Squared Error which tries to tell how far the reconstruction is from the original.

A bonus to this is to normalize the input tensors setting the mean of each feature to zero, this helps a lot the network to learn and I'll explain next.

Words are cheap, show me the backpropagation

Remember that values in the hidden layers are [-1,1]? well, this range and the support of the weights and a bias (Wx + b) makes it possible to have on each layer a continuous combination of fewer features (values from -1 to 1, considering ALL the possible rational numbers within).

With backpropagation (supported in the loss function), the idea is to find a sweet spot of weights to turn the domain training set (say black and white MNIST digits, RGB Cats images, and so on) into a low dimension continuous set (really smalls numbers ranging between [-1,1]) in the encoding layers, then, in the decoding layers it tries to use the same weights (remember is a sand-clock shape network) to emit the higher representation of the previous [-1,1] combination.

An analogy

To put this into a kind of game, two persons are back to back, one looking trough a window and the other has a whiteboard upfront. The first looks outside and see a flower with all the details, and says sunflower (the latent space), the second person hears that and draws a sunflower with all the colors and details that that person learned in the past.

A real world sample please

Continuing with the sunflower analogy, imagine the same case, but your input image (tensor) has noise (you know, glitchy). The AutoEncoder was trained with high quality images, so it's capable of compress the sunflower concept, then reconstruct it. What happened to the glitch? The network encoded the sunflower colors, shape, and background (let's say is a blue sky), the decoder reconstruct it, the glitch was left behind as residual. And this, is a Denoise AutoEncoder, one of many application of the network.