Confusion about the output channels of convolution neural network

I'm confused about the multi-channel scenario in convolution neural network.

Say I have a 10(width) * 5(height) * 6(channels) image, and I feed it into a default 2-D convolution layer with stride=1 and padding=0 and expect the output to be 8(width) * 3(height) * 16(channels).

I know the size of the kernel is 3(width) * 3(height), but I don't know how many kernels are there exactly, and how the are applied to the input data to give the final 16 channels.

Someone can help me please.

Solution

A 2D convolution layer contains one kernel per input channel, per output channel. So in your case, this will be 6*16=96 kernels. For 3x3 kernels, this corresponds to 3*3*96 = 864 parameters.

>>> import torch

>>> conv = torch.nn.Conv2d(6, 16, (3, 3))
>>> torch.numel(conv.weight)
864

For one image, one kernel per input channel is first applied. In your case, this results in 6 features maps, that are summed together (+ a possible bias) to form 1 of the output channel. Then, you repeat this 15 times to form the 15 other output channels.