What makes the difference in "grad attribute" in the following context?

Consider the following two contexts

with torch.no_grad():
  params = params - learning_rate * params.grad

and

with torch.no_grad():
  params -= learning_rate * params.grad

In the second case .backward() is running smoothly and in the first case it is giving the

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

What is the reason for this as it is normal to use x-= a and x = x - a interchangeably?

Solution

Note that x -= a and x = x - a cannot be used interchangeably: The latter creates a new tensor that is assigned to the variable x, while the former performes an in place operation.

Therefore with

with torch.no_grad():
  params -= learning_rate * params.grad

everything works fine in your optimization loop, while in

with torch.no_grad():
  params = params - learning_rate * params.grad

the variable params gets overwritten with a new tensor. Since this new tensor was created within a torch.no_grad() context, this means that this new tensor has params.requires_grad=False and therefore does not have a .grad attribute. Therefore in the next iteration torch will complain that params.grad does not exist.