How to imagine convolution/pooling on images with 3 color channels

I am a beginner and i understood the mnist tutorials. Now i want to get something going on the SVHN dataset. In contrast to mnist, it comes with 3 color channels. I am having a hard time visualizing how convolution and pooling works with the additional dimensionality of the color channels.

Has anyone a good way to think about it or a link for me ?

I appreciate all input :)

Solution

This is very simple, the difference only lies in the first convolution:

in grey images, the input shape is [batch_size, W, H, 1] so your first convolution (let's say 3x3) has a filter of shape [3, 3, 1, 32] if you want to have 32 dimensions after.
in RGB images, the input shape is [batch_size, W, H, 3] so your first convolution (still 3x3) has a filter of shape [3, 3, 3, 32].

In both cases, the output shape (with stride 1) is [batch_size, W, H, 32]