machine-learning deep-learning lstm recurrent-neural-network

Understanding dense layer in LSTM architecture (labels & logits)

I'm working through this notebook -- https://github.com/aamini/introtodeeplearning/blob/master/lab1/solutions/Part2_Music_Generation_Solution.ipynb -- where we are using an embedding layer, LSTM, and final dense layer w/ softmax to generate music.

I'm a little confused, however, on how we're calculating loss; it is my understanding that in this notebook (in compute_loss()), in any given batch, we are comparing expected labels (which are the notes themselves) to the logits (i.e. predictions from the dense layer). However, aren't these predictions supposed to be a probability distribution? When are we actually selecting the label that we are predicting against?

A little more clarification on my question: if the shape of our labels is (batch_size, # of time steps), and the shape of our logits is (batch_size, # of time steps, vocab_size), at what point in the compute_loss() function are we actually selecting a label for each time step?

Solution

The short answer is that the Keras loss function sparse_categorical_crossentropy() does everything you need.

At each timestep of the LSTM model, the top dense layer and softmax function inside that loss function together generate a probability distribution over the model's vocabulary, which in this case are musical notes. Suppose the vocabulary comprises the notes A, B, C, D. Then one possible probability distribution generated is: [0.01, 0.70, 0.28, 0.01], meaning that the model is putting a lot of probability on note B (index 1), like so:

Label:    A     B     C     D
----      ----  ----  ----  ----
Index:    0     1     2     3
----      ----  ----  ----  ----
Prob:     0.01  0.70  0.28  0.01

Suppose the true note should be C, which is represented by the number 2, since it is at index 2 in the distribution array (with indexing starting at 0). To measure the difference between the predicted distribution and the true value distributions, use the sparse_categorical_crossentropy() function to produce a floating-point number representing the loss.

More information can be found on this TensorFlow documentation page. On that page, they have the example:

y_true = [1, 2]
y_pred = [[0.05, 0.95, 0], [0.1, 0.8, 0.1]]
loss = tf.keras.losses.sparse_categorical_crossentropy(y_true, y_pred)

You can see in that example there is a batch of two instances. For the first instance, the true label is 1 and the predicted distribution is [0.05, 0.95, 0], and for the second instance, the true label is 2 while the predicted distribution is [0.1, 0.8, 0.1].

This function is used in your Jupyter Notebook in section 2.5:

To train our model on this classification task, we can use a form of the crossentropy loss (negative log likelihood loss). Specifically, we will use the sparse_categorical_crossentropy loss, as it utilizes integer targets for categorical classification tasks. We will want to compute the loss using the true targets -- the labels -- and the predicted targets -- the logits.

So to answer your questions directly:

it is my understanding that in this notebook (in compute_loss()), in any given batch, we are comparing expected labels (which are the notes themselves) to the logits (i.e. predictions from the dense layer).

Yes, your understanding is correct.

However, aren't these predictions supposed to be a probability distribution?

Yes, they are.

When are we actually selecting the label that we are predicting against?

It is done inside the sparse_categorical_crossentropy() function. If your distribution is [0.05, 0.95, 0], then that implicitly means that the function is predicting 0.05 probability for index 0, 0.95 probability for index 1, and 0.0 probability for index 3.

A little more clarification on my question: if the shape of our labels is (batch_size, # of time steps), and the shape of our logits is (batch_size, # of time steps, vocab_size), at what point in the compute_loss() function are we actually selecting a label for each time step?

It's inside that function.