deep-learning pytorch nvidia nvidia-apex

Is GradScaler necessary with Mixed precision training with pytorch?

So going the AMP: Automatic Mixed Precision Training tutorial for Normal networks, I found out that there are two versions, Automatic and GradScaler. I just want to know if it's advisable / necessary to use the GradScaler with the training becayse it is written in the document that:

Gradient scaling helps prevent gradients with small magnitudes from flushing to zero (“underflowing”) when training with mixed precision.

scaler = torch.cuda.amp.GradScaler()
for epoch in range(1):
    for input, target in zip(data, targets):
        with torch.cuda.amp.autocast():
            output = net(input)
            loss = loss_fn(output, target)

        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad()

Also, looking at NVIDIA Apex Documentation for PyTorch, they have used it as,

from apex import amp

model, optimizer = amp.initialize(model, optimizer)

loss = criterion(…)
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
optimizer.step()

I think this is what GradScaler does too so I think it is a must. Can someone help me with the query here.

Solution

Short answer: yes, your model may fail to converge without GradScaler().

There are three basic problems with using FP16:

Weight updates: with half precision, 1 + 0.0001 rounds to 1. autocast() takes care of this one.
Vanishing gradients: with half precision, anything less than (roughly) 2e-14 rounds to 0, as opposed to single precision 2e-126. GradScaler() takes care of this one.
Explosive loss: similar to the above, overflow is also much more likely with half precision. This is also managed by autocast() context.