Why is unnormalized input added to output in Huggingface T5 model?

In the T5 Hugging face code (see for instance this), it seems that Input is "never normalized", in the following sense : each component outputs : input + component_fct(norm(input)). So the initial network input kept being added to more and more tensor, which are the result of applying the current subcomponent to its normalized input.

Intuitively, I feel it would make more sense to have : norm(input) + component_fct(norm(input)), so that we add things of the same magnitude.

Is there a reason for doing as it is currently done ?

Solution

T5 uses residual connections/skip connections where the input to a layer/group is added to the output of that layer. This is done to avoid vanishing gradient problems-- where the gradients of the loss function become very small as they get backpropagated through layers of the network. this makes the network difficult to train effectively.

This method, where the original, unmodified input is combined with the output, is a unique feature of a pre-LayerNorm version of the Transformer model, which T5 employs. Layer Normalization (or LayerNorm) is executed before the self-attention and feed-forward sub-layers-- unlike the original Transformer model where it's applied afterwards. Consequently, the output of these sub-layers is combined with the original, unnormalized input.

The goal of models like T5 isn't necessarily to maintain the same scale or magnitude throughout the network, but to optimize the learning process and final performance.

This design choice has been found to improve the performance of the model-- you can see how they discuss this decision in the "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer," and the T5 model code in the 🤗Transformers library reflects these design choices.