python machine-learning deep-learning pytorch transformer-model

vision transformers: RuntimeError: mat1 and mat2 shapes cannot be multiplied (32x1000 and 768x32)

I am trying to do Regression on the vision transformers model and I cannot replace the last layer of classification with the regression layer

class RegressionViT(nn.Module):
    def __init__(self, in_features=224 * 224 * 3, num_classes=1, pretrained=True):
        super(RegressionViT, self).__init__()
        self.vit_b_16 = vit_b_16(pretrained=pretrained)
        # Accessing the actual output feature size from vit_b_16
        self.regressor = nn.Linear(self.vit_b_16.heads[0].in_features, num_classes * batch_size)

    def forward(self, x):
        x = self.vit_b_16(x)
        x = self.regressor(x)
        return x


# Model
model = RegressionViT(num_classes=1)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

criterion = nn.MSELoss()  # Use appropriate loss function for regression
optimizer = optim.Adam(model.parameters(), lr=0.0001)

I get this error when I try to initialize and run the model

RuntimeError: mat1 and mat2 shapes cannot be multiplied (32x1000 and 768x32)

The problem is that there is a mismatch between the regression layer and the vit_b_16 model layer, what would be the correct way to solve this issue

Solution

If you look into the source code of VisionTransformer, you will notice in this section that self.heads is a sequential layer, not a linear layer. By default, it only contains a single layer head corresponding to the final classification layer. To overwrite this layer, you can do:

heads = self.vit_b_16.heads
heads.head = nn.Linear(heads.head.in_features, num_classes)