machine-learning deep-learning pytorch batch-processing loss-function

Calculating negative ELBO

I am going through the tutorial of deep Markov models where they are trying to learn the polyphonic dataset. The Link to tutorial is:

https://pyro.ai/examples/dmm.html

This model parameterises transitions and emissions with using a neural network and for variational inference part they use RNN to map observable 'x' to latent space. And in order to ensure that their model is learning something they try to maximise ELBO or minimise negative ELBO. They refer to negative ELBO as NLL. So far I understand what they are doing. However, the next step confuses me. Once they have their NLL, they divide it by sum of sequence lengths.

times = [time.time()]
for epoch in range(args.num_epochs):
    # accumulator for our estimate of the negative log likelihood
    # (or rather -elbo) for this epoch
    epoch_nll = 0.0
    # prepare mini-batch subsampling indices for this epoch
    shuffled_indices = np.arange(N_train_data)
    np.random.shuffle(shuffled_indices)

    # process each mini-batch; this is where we take gradient steps
    for which_mini_batch in range(N_mini_batches):
        epoch_nll += process_minibatch(epoch, which_mini_batch, shuffled_indices)

    # report training diagnostics
    times.append(time.time())
    epoch_time = times[-1] - times[-2]
    log("[training epoch %04d]  %.4f \t\t\t\t(dt = %.3f sec)" %
        (epoch, epoch_nll / N_train_time_slices, epoch_time))

And I don't quite understand why they are doing that. Can some explain? Are they averaging here? Insights would be appreciated.

Solution

In the tutorial, through optimisation process they are trying to reduce the loss and finally wants to compare it with reference[1] in the tutorial.

"Finally we report some diagnostic info. Note that we normalize the loss by the total number of time slices in the training set (this allows us to compare to reference [1])."

This is from the tutorial that you have provided.

Basically loss is calculated for all the mini-batches and they are normalising it such that final loss would be the loss over whole training data sequence length they have originally taken.

And when we will run the code we can the overall loss after every epoch in diagnostic report generated from logging.