In my experiment, I want to first train a low-level model (L), then reuse it in a higher-level task (H). Usually I would come to freezing the model L when trainig H. But is it possible to not completely freeze L, but freeze it by a coefficient, sort of.
I'm sorry if this sounds mathematically incorrect, but if we assume that in the case of a non-frozen model it is affected by gradient at scale of 1.0, and when it's frozen it is affected by 0.0, I would love to be able to vary this coefficient, so I could have not a completely frozen module (0.0), but still be partially affected by the gradient descent (fro example, by 0.1). But it is still important, that model L fully affects the result of H. Or in other words, it affects the result by the scale of 1.0, but at the stage of back-propagation, it is affected by 0.1.
The main idea behind this is for model L to get slightly tuned w.r.t. to a high-level task.
I googled the question, but the best I came up with these two questions, which, I believe, could contain a hint to my question, but I still can't figure out how to have separate "weights" for forward and backward passes:
From what I understand, you're trying to specify a different learning rate for different parts of your model. Pytorch optimizers support that option directly:
optim.SGD([
{'params': model.base.parameters()},
{'params': model.L.parameters(), 'lr': 1e-3}
], lr=1e-2, momentum=0.9)
From there, you can run a training loop as usual.