Should dropout be deactivated when training a model with some freezed modules?

I have a deep neural network made of a combination o modules, such as an encoder, a decoder, etc. Before training, I load a part of its parameters from a pretrained model, just for a subset of modules. For instance, I could load a pretrained encoder. Then I want to freeze the parameters of the pretrained modules so that they are not trained with the rest. In Pytorch:

for param in submodel.parameters()
     param.requires_grad = False

Now, should I keep applying dropout to these freezed modules while learning or should I deactivate it (see example below) ? Why?

def MyModel(nn.Module):
    ...
    def forward(x):
        if freeze_submodule:
            self.submodule.eval()  # disable dropout when submodule is frozen
        x = self._forward(x)
        if freeze_submodule:
            self.submodule.train()

Solution

Freezing module

You can freeze parameters by setting requires_grad_(False), which is less verbose:

submodel.requires_grad_(False)

This will freeze all submodel parameters.

You could also use with torch.no_grad context manager over submodel forward pass but it is less common indeed.

`eval`

Running submodule.eval() puts certain layers in evaluation mode (BatchNorm or Dropout). For Dropout (inverted dropout actually) you can check how it works in this answer.

Q: should dropout still be applied to freezed parameters?

No, as the weights will be unable to compensate dropout's effect which is one of it's goals (to make it more robust and spread information flow across more paths). They will be unable to do it as they are untrainable.

On the other hand, leaving dropout would add more noise and error to the architecture and might force your trainable part of the network to compensate for it, I'd go for experimenting.

freezing pretrained submodules is useful to avoid their weights being messed up by the gradients that will result from training non-pretrained submodules

Depends, fastai community uses smaller learning rates for pretrained modules, still leaving them trainable (see this blog post for example), which makes intuitive sense (task's distribution is somehow different than the one your backbone was pretrained, hence it's reasonable to assume weights need to be adjusted by some amount (possibly small) as well).