Search code examples
machine-learningcomputer-visionconvolution

How a Convolutional Neural Net handles channels


I've looked through a lot of explanations of the way a CNN conventionally handles multiple channels (such as 3 in an RGB image) and am still at a loss.

When a 5x5x3 filter (say) is applied to a patch of an RGB image what exactly happens? Is it in fact 3 different 2D convolutions (with independent weights) that happen separately to each channel? And then the results get simply added together to produce the final output to pass to the next layer? Or is it a truly 3D convolution?


Solution

  • enter image description here

    This image is from Andrew Ng's deeplearning.ai course. 6 X 6 X 3 - where 3 corresponds to 3 color channels. 6 X 6 being the height and widht of the image. For the convolution step we convolve the input image with 3 X 3 X 3 filter/kernel. The input image and filter both will have 3 layers. (Mostly both are same for input image and filter).The output will be 4 X 4 X 1. 3 X 3 X 3 gives you 27 features/parameters which you multiply with the corresponding Red, Green and blue channels. Finally add up all those numbers to get the value for [0,0] in 4 X 4 output image. Now move the yellow cube of the input image and slide it over 1 box to your right and once it reaches the right end, you slide the cube one row down and continue your multiplication to fill the 4 X 4 output. Would suggest you to take a paper and pencil, fill random values in all the cubes for input as well as the kernel and solve the multiplication.

    For more details watch these lectures on youtube. https://www.youtube.com/watch?v=KTB_OFoAQcc&index=6&list=PLkDaE6sCZn6Gl29AoE31iwdVwSG-KnDzF

    https://www.youtube.com/watch?v=7g8jpK4llkc&t=1s