neural-network conv-neural-network convolution biological-neural-network

How deeper layers learn from previous layers feature maps in Conv nets

I read a lot about convnets but still I miss a important part.

Let's say we have a conv2D layer with 32 filters:

I understand that these filters weights are initialize randomly in the beginning and during training process these filters are formed. So at first layer they start to detect edges.

And now after pooling we have another conv layer(let's say 32 filters again) which will apply filters on the result of the previous layer.

So layer 2 will apply 32 filters on ANY of these 32 outputs from first layer. I saw so many examples of these feature maps: first layer produce pictures of edges, on the next layer pictures are forms, ear, nose and so on. My question is how is this possible?

If layer 2 apply filters on layer 1 result and layer 1 result are edges then how you get a form from a edge?

I clearly miss something here, please help me understand how is possible every next layer in conv net producing richer features like forms, eye , face in case it uses production from previous layer where features are just lines and edges?

Is there some information merging during the process I'm missing or it's something more?

Thanks in advance

Solution

Simple example: let's say you try to distinguish simple geometric forms. E.g. rectangles from diamonds.

On the first layer you have various edge detectors. Some fire when they detect horizontal edges, some when they detect vertical edges and some others when they see diagonal edges.

The second layer can now combine those inputs to more complicated shapes. So one filter/detector will fire if on the first layer vertical and horizontal edges are detected. This is the filter for the rectangle.

Another filter will fire when the first layers tells that it detected diagonal edges. This is the filter for the diamonds.

You might make yourself familiar with dimensions of in and output of a convolutional layer.

Input = W1xW1xD1

Output:
W2 = (W1 - F + 2P)/S + 1
D2 = K

Terminology: K = Number of Filters, F= Spatial Size of Filter, P=ZeroPadding, S=Stride

You might find this helpful:

https://adeshpande3.github.io/adeshpande3.github.io/A-Beginner's-Guide-To-Understanding-Convolutional-Neural-Networks/

http://cs231n.github.io/convolutional-networks/