Search code examples
pytorchpytorch-lightning

PyTorch Lightning CSVLogger: Why are training and validation losses on different lines?


I have a question about the PyTorch Lightning framework's CSVLogger that has been bugging me for a couple of weeks already.

When I try to log the training and validation losses in their respective training_step and validation_step functions, it seems that the resulting CSV file logs both metrics on separate lines in the metrics.csv file. The file looks like this:

Epoch train_loss val_loss
0 0.01 null
0 null 0.02
1 0.005 null
1 null 0.01
2 0.01 null
2 null 0.02

It also shows a step number that I've omitted here, though it's the same for each unique epoch.

Is there any way to place these in a single line in the CSV, using the built-in CSVLogger? I couldn't find anything about this online nor in the documentation.


The following code produces the problem described above:


import torch
from torch.nn import functional as F
from torch.utils.data import TensorDataset
import lightning as pl
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
features, target = iris.data, iris.target

train_features, val_features, train_target, val_target = train_test_split(features, target, test_size=0.2)

train_features = torch.tensor(train_features).float()
val_features = torch.tensor(val_features).float()
train_target = torch.tensor(train_target).long()
val_target = torch.tensor(val_target).long()

dm = pl.LightningDataModule.from_datasets(
    train_dataset=TensorDataset(train_features, train_target),
    val_dataset=TensorDataset(val_features, val_target),
    batch_size=5,
)

class Model(pl.LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(4, 3)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.layer(x)
        loss = F.cross_entropy(y_hat, y)
        self.log("train_loss", loss, prog_bar=True, on_step=False, on_epoch=True)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.layer(x)
        loss = F.cross_entropy(y_hat, y)
        self.log("val_loss", loss, prog_bar=True, on_step=False, on_epoch=True)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

    def forward(self, x):
        return self.layer(x)

model = Model()
trainer = pl.Trainer(max_epochs=10)
trainer.fit(model, dm)

PyTorch Lightning version: 2.2.2


Solution

  • As a workaround, when I'm processing my data I just do

    import pandas as pd
    
    log = pd.read_csv("path_to_log_file.csv", sep=',')
    log = log.groupby('epoch').mean()  # merge the train and valid rows
    log['Epoch'] = log.index  # because "Epoch" gets turned into the index
    log.index.name = ''  # to remove the name "Epoch" from the index
    

    Works fine for me in pandas v1.4.2 (not sure about others), since the NaNs are treated as 0 by .mean().