Neural net loss exponentially rises after first propogation

I am training a neural network on video frames (converted to greyscale) to output a tensor with two values. The first iteration always evaluates an acceptable loss (mean squared error generally between 15-40), followed by an exponential rise in the second pass, and then infinite.

The net is quite vanilla:

class NeuralNetwork(nn.Module):

    def __init__(self):
        super(NeuralNetwork, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(100 * 291, 29100),
            nn.ReLU(),
            nn.Linear(29100, 29100),
            nn.ReLU(),
            nn.Linear(29100, 2),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

As is the training loop:

def train(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    model.train()
    for batch, (X, y) in enumerate(dataloader):
        X, y = X.to("cpu"), y.to("cpu")

        # Compute prediction error
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropogation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

Example of loss function growth:

ITERATION 1
prediction: tensor([[-1.2239, -8.2337]], grad_fn=<AddmmBackward>)
actual:     tensor([[0.0321, 0.0325]])
loss:       tensor(34.9545, grad_fn=<MseLossBackward>)


ITERATION 2
prediction: tensor([[ 314636.5625, 2063098.2500]], grad_fn=<AddmmBackward>)
actual:     tensor([[0.0330, 0.0323]])
loss:       tensor(2.1777e+12, grad_fn=<MseLossBackward>)


ITERATION 3
prediction: tensor([[-8.0924e+22, -5.3062e+23]], grad_fn=<AddmmBackward>)
actual:     tensor([[0.0334, 0.0317]])
loss:       tensor(inf, grad_fn=<MseLossBackward>)

Here is an example of the video data: it's a 291x100 greyscale image and there are 1100 of them in the training dataset:

dataset.video_frames.size()
> torch.Size([1100, 100, 291])

dataset.video_frames[0]
> tensor([[21., 29., 28.,  ..., 33., 27., 26.],
        [22., 27., 25.,  ..., 25., 25., 30.],
        [23., 26., 26.,  ..., 24., 24., 28.],
        ...,
        [24., 33., 31.,  ..., 41., 40., 42.],
        [26., 34., 31.,  ..., 26., 20., 22.],
        [25., 32., 32.,  ..., 21., 20., 18.]])

And the labeled training data:

dataset.y.size()
> torch.Size([1100, 2])

dataset.y[0]
> tensor([0.0335, 0.0315], dtype=torch.float)

I've fiddled the learning rate, number of hidden layers, and nothing seems to keep the loss from going to infinite.

Solution

Properly scaling the inputs is crucial for proper training. Weights are initialized based on some assumptions on the way inputs are scaled. See this part of a lecture on weight initialization and see how critical it is for proper convergence.

More details on the mathematical analysis of the influence of weight initialization can be found in Sec. 2 of this paper:
Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification (ICCV 2015).