Search code examples
pytorchloss-function

What should I think about when writing a custom loss function?


I'm trying to get my toy network to learn a sine wave.

I output (via tanh) a number between -1 and 1, and I want the network to minimise the following loss, where self(x) are the predictions.

loss = -torch.mean(self(x)*y)

This should be equivalent to trading a stock with a sinusoidal price, where self(x) is our desired position, and y are the returns of the next time step.

The issue I'm having is that the network doesn't learn anything. It does work if I change the loss function to be torch.mean((self(x)-y)**2) (MSE), but this isn't what I want. I'm trying to focus the network on 'making a profit', not making a prediction.

I think the issue may be related to the convexity of the loss function, but I'm not sure, and I'm not certain how to proceed. I've experimented with differing learning rates, but alas nothing works.

What should I be thinking about?

Actual code:

%load_ext tensorboard
import matplotlib.pyplot as plt; plt.rcParams["figure.figsize"] = (30,8)
import torch;from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F;import pytorch_lightning as pl
from torch import nn, tensor
def piecewise(x): return 2*(x>0)-1

class TsDs(torch.utils.data.Dataset):
  def __init__(self, s, l=5): super().__init__();self.l,self.s=l,s
  def __len__(self): return self.s.shape[0] - 1 - self.l
  def __getitem__(self, i): return self.s[i:i+self.l], torch.log(self.s[i+self.l+1]/self.s[i+self.l])
  def plt(self): plt.plot(self.s)

class TsDm(pl.LightningDataModule):
  def __init__(self, length=5000, batch_size=1000): super().__init__();self.batch_size=batch_size;self.s = torch.sin(torch.arange(length)*0.2) + 5 + 0*torch.rand(length)
  def train_dataloader(self): return DataLoader(TsDs(self.s[:3999]), batch_size=self.batch_size, shuffle=True)
  def val_dataloader(self): return DataLoader(TsDs(self.s[4000:]), batch_size=self.batch_size)

dm = TsDm()

class MyModel(pl.LightningModule):
    def __init__(self, learning_rate=0.01):
        super().__init__();self.learning_rate = learning_rate
        super().__init__();self.learning_rate = learning_rate
        self.conv1 = nn.Conv1d(1,5,2)
        self.lin1 = nn.Linear(20,3);self.lin2 = nn.Linear(3,1)
        # self.network = nn.Sequential(nn.Conv1d(1,5,2),nn.ReLU(),nn.Linear(20,3),nn.ReLU(),nn.Linear(3,1), nn.Tanh())
        # self.network = nn.Sequential(nn.Linear(5,5),nn.ReLU(),nn.Linear(5,3),nn.ReLU(),nn.Linear(3,1), nn.Tanh())
    def forward(self, x): 
        out = x.unsqueeze(1)
        out = self.conv1(out)
        out = out.reshape(-1,20)
        out = nn.ReLU()(out)
        out = self.lin1(out)
        out = nn.ReLU()(out)
        out = self.lin2(out)
        return nn.Tanh()(out)

    def step(self, batch, batch_idx, stage):
        x, y = batch
        loss = -torch.mean(self(x)*y)
        # loss = torch.mean((self(x)-y)**2)
        print(loss)
        self.log("loss", loss, prog_bar=True)
        return loss
    def training_step(self, batch, batch_idx): return self.step(batch, batch_idx, "train")
    def validation_step(self, batch, batch_idx): return self.step(batch, batch_idx, "val")
    def configure_optimizers(self): return torch.optim.SGD(self.parameters(), lr=self.learning_rate)

#logger = pl.loggers.TensorBoardLogger(save_dir="/content/")
mm = MyModel(0.1);trainer = pl.Trainer(max_epochs=10)
# trainer.tune(mm, dm)
trainer.fit(mm, datamodule=dm)
# 

Solution

  • If I understand you correctly, I think that you were trying to maximize the unnormalized correlation between the network's prediction, self(x), and the target value y.

    As you mention, the problem is the convexity of the loss wrt the model weights. One way to see the problem is to consider that the model is a simple linear predictor w'*x, where w is the model weights, w' it's transpose, and x the input feature vector (assume a scalar prediction for now). Then, if you look at the derivative of the loss wrt the weight vector (i.e., the gradient), you'll find that it no longer depends on w!

    One way to fix this is change the loss to,

    loss = -torch.mean(torch.square(self(x)*y))
    

    or

    loss = -torch.mean(torch.abs(self(x)*y))
    

    You will have another big problem, however: these loss functions encourage unbound growth of the model weights. In the linear case, one solves this by a Lagrangian relaxation of a hard constraint on, for example, the norm of the model weight vector. I'm not sure how this would be done with neural networks as each layer would need it's own Lagrangian parameter...