python pytorch conv-neural-network dataloader

What should be the input shape for 3D CNN on a sequence of images?

https://pytorch.org/docs/stable/generated/torch.nn.Conv3d.html#conv3d Describes that the input to do convolution on 3D CNN is (N,C_in,D,H,W). Imagine if I have a sequence of images which I want to pass to 3D CNN. Am I right that:

N -> number of sequences (mini batch)
C_in -> number of channels (3 for rgb)
D -> Number of images in a sequence
H -> Height of one image in the sequence
W -> Width of one image in the sequence

The reason why I am asking is that when I stack image tensors: a = torch.stack([img1, img2, img3, img4, img5]) I get shape of a torch.Size([5, 3, 396, 247]), so is it compulsory to reshape my tensor to torch.Size([3, 5, 396, 247]) so that number of channels would go first or it does not matter inside the Dataloader?

Note that Dataloader would add one more dimension automatically which would correspond to N.

Solution

Yes it matters, you need to ensure that dimensions are ordered correctly (assuming you use DataLoader's default collate function). One way to do this is to invoke torch.stack using dim=1 instead of the default of dim=0. For example

a = torch.stack([img1, img2, img3, img4, img5], dim=1)

results in a being the desired shape of [3, 5, 396, 247].