I am implementing a simple feedforward neural network with Pytorch and the loss function does not seem to decrease. Because of some other tests I have done, the problem seems to be in the computations I do to compute pred, since if I slightly change the network so that it spits out a 2-dimensional vector for each entry and save it as pred, everything works perfectly.
Do you see the problem in defining pred here? Thanks
import torch
import numpy as np
from torch import nn
dt = 0.1
class Neural_Network(nn.Module):
def __init__(self, ):
super(Neural_Network, self).__init__()
self.l1 = nn.Linear(2,300)
self.nl = nn.Tanh()
self.l2 = nn.Linear(300,1)
def forward(self, X):
z = self.l1(X)
z = self.nl(z)
o = self.l2(z)
return o
N = 1000
X = torch.rand(N,2,requires_grad=True)
y = torch.rand(N,1)
NN = Neural_Network()
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.Adam(NN.parameters(), lr=1e-5)
epochs = 200
for i in range(epochs): # trains the NN 1,000 times
HH = torch.mean(NN(X))
gradH = torch.autograd.grad(HH, X)[0]
XH= torch.cat((gradH[:,1].unsqueeze(0),-gradH[:,0].unsqueeze(0)),dim=0).t()
pred = X + dt*XH
#Optimize and improve the weights
loss = criterion(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print (" Loss: ", loss.detach().numpy()) # mean sum squared loss
P.S. With these X and y the loss is not expected to go to zero, I have added them here like them just for simplicity. I will apply this architecture to data points which are expected to satisfy this model. However I am just interested in seeing the loss decreasing.
My aim is to approximate with a neural network the Hamiltonian of a vector field where only some trajectory is known. For example only the updates x(t)\rightarrow x(t+\Delta t)
for some choice of points. So the vector X
contains the points x(t)
, while y
contains the $x(t+\Delta t)$. My network above approximates in a simple way the Hamiltonian function H(x)
and in order to optimize it I need to find the trajectories associated to this Hamiltonian.
In particular XH
aims to be the Hamiltonian vector field associated to the approximated Hamiltonian. The time update pred = X + dt*XH
is simply one step of forward Euler.
However, my main issue here can be abstracted in: how can I involve the gradient of a network with respect to its inputs in the loss function?
Probably because the gradient flow graph for NN
is destroyed with the gradH
step. (check HH.grad_fn
vs gradH.grad_fn
)
So your pred
tensor (and subsequent loss) does not contain the necessary gradient flow through the NN
network.
The loss
contains gradient flow for the input X
, but not for the NN.parameters()
. Because the optimizer only take a step()
over thoseNN.parameters()
, the network NN
is not being updated, and since X
is neither being updated, loss does not change.
You can check how the loss is sending it's gradients backward by checking loss.grad_fn
after loss.backward()
and here's a neat function (found on Stackoverflow) to check it:
def getBack(var_grad_fn):
print(var_grad_fn)
for n in var_grad_fn.next_functions:
if n[0]:
try:
tensor = getattr(n[0], 'variable')
print(n[0])
print('Tensor with grad found:', tensor)
print(' - gradient:', tensor.grad)
print()
except AttributeError as e:
getBack(n[0])
with getBack(loss.grad_fn)
after loss.backward()
to check it for yourself (maybe reduce size of batch N before though)
Edit: It works by changing gradH = torch.autograd.grad(HH, X, create_graph=True)[0]