python neural-network pytorch conv-neural-network autoencoder

Pytorch identifying batch size as number of channels in Conv2d layer

I am a total newbie to neural networks using Pytorch to create a VAE model. I've used a bit of tensorflow before, but I have no idea what "in_channels" and "out_channels" are, as arguments to nn.Conv2d/nn.Conv1d.

Disclaimers aside, currently, my model takes in a dataloader with batch size 128 and where each input is a 248 by 46 tensor (so, a 128 x 248 x 46 tensor).

My encoder looks like this right now -- I chopped it down so I could focus on where the error was coming from.

class Encoder(nn.Module):
    def __init__(self, latent_dim):
        super(Encoder, self).__init__()
        self.latent_dim = latent_dim
        self.conv1 = nn.Conv2d(in_channels=248, out_channels=46, kernel_size=(9, 9), stride=(5, 1), padding=(5, 4))

    def forward(self, x):
        print(x.size())
        x = F.relu(self.conv1(x))
        return x

The Conv2d layer was meant to reduce the 248 by 46 input into a 50 by 46 tensor. However, I get this error:

RuntimeError: Given groups=1, weight of size [46, 248, 9, 9], expected input[1, 128, 248, 46] to have 248 channels, but got 128 channels instead

...even though I print x.size() and it displays as [torch.Size([128, 248, 46]).

I am unsure a) why the error shows that the layer is adding on an extra dimension to x, and b) whether I am even understanding channels correctly. Should 46 be the real number of channels? Why doesn't Pytorch simply request my input size as a tuple or something, like in=(248, 46)? Or c) if this is an issue with the way I loaded in my data to the model. I have a numpy array data of shape (-1, 248, 46) and then started training my model as follows.

tensor_data = torch.from_numpy(data)
dataset = TensorDataset(tensor_data, tensor_data)
train_dl = DataLoader(dataset, batch_size=128, shuffle=True)
...
for epoch in range(20):
     for x_train, y_train in train_loader:
          x_train = x_train.to(device).float()
          optimizer.zero_grad()
          x_pred, mu, log_var = vae(x_train)
          bce_loss = train.BCE(y_train, x_pred)
          kl_loss = train.KL(mu, log_var)
          loss = bce_loss + kl_loss
          loss.backward()
          optimizer.step()

Any thoughts appreciated!

Solution

In pytorch, nn.Conv2d assumes the input (mostly image data) is shaped like: [B, C_in, H, W], where B is the batch size, C_in is the number of channels, H and W are the height and width of the image. The output has a similar shape [B, C_out, H_out, W_out]. Here, C_in and C_out are in_channels and out_channels, respectively. (H_out, W_out) is the output image size, which may or may not equal (H, W), depending on the kernel size, the stride and the padding.

However, it is confusing to apply conv2d to reduce [128, 248, 46] inputs to [128, 50, 46]. Are they image data with height 248 and width 46? If so you can reshape the inputs to [128, 1, 248, 46] and use in_channels = 1 and out_channels = 1 in conv2d.