Search code examples
pythonmathematical-optimizationpytorchnewtons-methodautomatic-differentiation

Update step in PyTorch implementation of Newton's method


I'm trying to get some insight into how PyTorch works by implementing Newton's method for solving x = cos(x). Here's a version that works:

x =  Variable(DoubleTensor([1]), requires_grad=True)

for i in range(5):
    y = x - torch.cos(x)
    y.backward()
    x = Variable(x.data - y.data/x.grad.data, requires_grad=True)

print(x.data) # tensor([0.7390851332151607], dtype=torch.float64) (correct)

This code seems inelegant (inefficient?) to me since it's recreating the entire computational graph during each step of the for loop (right?). I tried to avoid this by simply updating the data held by each of the variables instead of recreating them:

x =  Variable(DoubleTensor([1]), requires_grad=True)
y = x - torch.cos(x)
y.backward(retain_graph=True)

for i in range(5):
    x.data = x.data - y.data/x.grad.data
    y.data = x.data - torch.cos(x.data)
    y.backward(retain_graph=True)

print(x.data) # tensor([0.7417889255761136], dtype=torch.float64) (wrong)

Seems like, with DoubleTensors, I'm carrying enough digits of precision to rule out round-off error. So where's the error coming from?

Possibly related: The above snippet breaks without the retain_graph=True flag set at every step if the for loop. The error message I get if I omit it within the loop --- but retain it on line 3 --- is: RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time. This seems like evidence that I'm misunderstanding something...


Solution

  • I think your first version of code is optimal, meaning that it is not creating a computation graph on every run.

    # initial guess
    guess = torch.tensor([1], dtype=torch.float64, requires_grad = True) 
    
    # function to optimize
    def my_func(x): 
        return x - torch.cos(x)
    
    def newton(func, guess, runs=5): 
        for _ in range(runs): 
            # evaluate our function with current value of `guess`
            value = my_func(guess)
            value.backward()
            # update our `guess` based on the gradient
            guess.data -= (value / guess.grad).data
            # zero out current gradient to hold new gradients in next iteration 
            guess.grad.data.zero_() 
        return guess.data # return our final `guess` after 5 updates
    
    # call starts
    result = newton(my_func, guess)
    
    # output of `result`
    tensor([0.7391], dtype=torch.float64)
    

    In each run, our function my_func(), which defines the computation graph, is evaluated with the current guess value. Once that returns the result, we compute the gradient (with value.backward() call). With this gradient, we now update our guess and zero-out our gradients so that it will be afresh for holding the gradients the next time we call value.backward() (i.e. it stops accumulating the gradient; without zeroing out the gradient, it would by default starts accumulating the gradients. But, we want to avoid that behaviour here).