LSTM for predicting characters: cell state and hidden state in the training loop

My goal is to build a model that predicts next character. I have built a model and here is my training loop:

model = Model(input_size = 30,hidden_size = 256,output_size = len(dataset.vocab))

EPOCH = 10
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
init_states = None

for epoch in range(EPOCH):
    loss_overall = 0.0
    
    for i, (inputs,targets) in enumerate(dataloader):
        optimizer.zero_grad()
        
        pred = model.forward(inputs) 
        loss = criterion(pred, targets)
        loss.backward()
        optimizer.step()

As you can see I return only predictions of the model, but not cell_state and hidden_state. So alternative is : pred,cell_state,hidden_state = model.forward(inputs)

My question is: should I do it for the prediction of characters task? Why/why not? And in general: when should I return my hidden and cell state?

Solution

To understand hidden states, here's a excellent diagram by @nnnmmm from this other StackOverflow post.

The hidden states are (h_n, c_n) i.e. the hidden states at the last timestep. Notice how you can't access the previous states for timesteps < t and all hidden layers. Retrieving those final hidden states would be useful if you need to access hidden states for a bigger RNN comprised of multiple hidden layers. However, usually you would just use a single nn.LSTM module and set its num_layers to the desired value.

You don't need to use hidden states. If you want to read more about this thread from the PyTorch forum.

Back to your other question, let's take this model as an example

rnn = nn.LSTM(input_size=10, hidden_size=256, num_layers=2, batch_first=True)

This means an input sequence has seq_length elements of size input_size. Considering the batch on the first dimension, its shape turns out to be (batch, seq_len, input_size).

out, (h, c) = rnn(x)

If you are looking to build a character prediction model I see two options.

You could evaluate a loss at every timestep. Consider an input sequence x and it's target y and the RNN output out. This means for every timestep t you will compute loss(out[t], y[t])`. And the total loss on this input sequence would be averaged over all timesteps.
Else just consider the prediction on the last timestep and compute the loss: loss(out[-1], y) where y is the target which only contains the seq_length+1-th character of the sequence.

If you're using nn.CrossEntropyLoss, both approaches will only require a single function call, as explained in your last thread.