Hello dear community,
I am training a Seq2Seq model to generate a question based on a graph. Both train and val loss are converging, but the generated questions (on either train or test set) are nonsense and contain mostly repetition of tokens. I tried various hyper parameters and double checked input and outputs tensors.
Something that I do find odd is that the output out
(see below) starts containing some values, which I consider as unusually high. This starts happening around half way through the first epoch:
Out: tensor([[ 0.2016, 103.7198, 90.4739, ..., 0.9419, 0.4810, -0.2869]]
My guess for that is vanishing/exploding gradients, which I thought I had handeled by gradient clipping, but now I am not sure about this:
for p in model_params:
p.register_hook(lambda grad: torch.clamp(
grad, -clip_value, clip_value))
Below are the training curves (10K samples, batch size=128, lr=0.065, lr_decay=0.99, dropout=0.25)
Encoder (a GNN, learning node embeddings of the input graph, that consists of around 3-4 nodes and edges. A single graph embedding is obtained by pooling the node embeddings and feeding them as the initial hidden state to the Decoder):
class QuestionGraphGNN(torch.nn.Module):
def __init__(self,
super(QuestionGraphGNN, self).__init__()
nn1 = torch.nn.Sequential(
torch.nn.Linear(in_channels, hidden_channels),
torch.nn.Linear(hidden_channels, in_channels * hidden_channels))
self.conv = NNConv(in_channels, hidden_channels, nn1, aggr=aggr)
self.lin = nn.Linear(hidden_channels, out_channels)
self.dropout = dropout
def forward(self, x, edge_index, edge_attr):
x = self.conv(x, edge_index, edge_attr)
x = F.leaky_relu(x)
x = F.dropout(x, p=self.dropout)
x = self.lin(x)
return x
Decoder (The out
vector from above is printed in the forward() function):
class DecoderRNN(nn.Module):
def __init__(self,
super(DecoderRNN, self).__init__()
self.output_size = output_size
self.dropout = dropout
self.embedding = nn.Embedding(output_size, embedding_size)
self.gru1 = nn.GRU(embedding_size, embedding_size)
self.gru2 = nn.GRU(embedding_size, embedding_size)
self.gru3 = nn.GRU(embedding_size, embedding_size)
self.out = nn.Linear(embedding_size, output_size)
self.logsoftmax = nn.LogSoftmax(dim=1)
def forward(self, inp, hidden):
output = self.embedding(inp).view(1, 1, -1)
output = F.leaky_relu(output)
output = F.dropout(output, p=self.dropout)
output, hidden = self.gru1(output, hidden)
output = F.dropout(output, p=self.dropout)
output, hidden = self.gru2(output, hidden)
output, hidden = self.gru3(output, hidden)
out = self.out(output[0])
print("Out: ", out)
output = self.logsoftmax(out)
return output, hidden
I am using PyTorchs NLLLoss()
Optimizer is SGD.
I call optimizer.zero_grad()
right before the backward and optimizer step and I switch the training/evaluation mode for training, evaluation and testing.
What are your thoughts on this?
Thank you very much!
Dimensions of the Encoder:
=301 (This is the size of the initial node embeddings)
=301 (This will also be the size of the final graph embedding, after mean pooling the node embeddings)
Dimensions of the Decoder:
=301 (the size of the previously pooled graph embedding)
=number of words in my vocabulary. In the training above around 1.2K
I am using top-k sampling and my train loop follows the NMT Tutorial https://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#training-the-model). Similarily, my translation function, that takes the data of a single graph, decodes a question as such:
def translate(self, data):
# Get node embeddings of the input graph
h = self.encoder(data.node_embeddings,
data.edge_index, data.edge_embeddings)
# Pool node embeddings into single graph embedding
graph_embedding = self.get_graph_embeddings(h, data.graph_dict)
# Pass graph embedding through decoder
with torch.no_grad():
# Initialize first input and hidden state
decoder_input = decoder_input = torch.tensor(
[[self.vocab.SOS['idx']]], device=self.device)
decoder_hidden = graph_embedding.view(1, 1, -1)
decoder_tokens = []
for di in range(self.dec_max_length):
decoder_output, decoder_hidden = self.decoder(
decoder_input, decoder_hidden)
topv, topi = decoder_output.data.topk(1)
if topi.item() == self.vocab.EOS['idx']:
word = self.vocab.index2word[topi.item()]
word = word.upper(
) if word == self.vocab.UNK['token'].lower() else word
decoder_input = topi.squeeze().detach()
return decoder_tokens
Also: At times, the output
-vector of the final gru layer (self.gru3(...)
) inside the forward() function (5th line from the bottom) outputs a lot of values being (close to) 1 and -1. I suppose these might otherwise be a lot higher/lower without clipping. This might be alright, but seems unusual to me. An example:
tensor([[[-0.9984, -0.9950, 1.0000, -0.9889, -1.0000, -0.9770, -0.0299,
-0.9996, 0.9996, 1.0000, -0.0176, -0.5815, -0.9998, -0.0265,
-0.1471, 0.9998, -1.0000, -0.2356, 0.9964, 0.9936, -0.9998,
0.0652, -0.9999, 0.9999, -1.0000, -0.9998, -0.9999, 0.9998,
-1.0000, -0.9997, 0.9850, 0.9994, -0.9998, -1.0000, -1.0000,
0.9977, 0.9015, -0.9982, 1.0000, 0.9980, -1.0000, 0.9859,
0.6670, 0.9998, 0.3827, 0.9999, 0.9953, -0.9989, 0.1287,
1.0000, 1.0000, -1.0000, 0.9778, 1.0000, 1.0000, -0.9907, ...
Your code looks good, and given the training/validation curves you posted, it looks like it's doing alright.
How are you generating text samples? Are you just taking the word the model predicts with the highest probability, appending to the end of your input sequence, and calling forward again? This sampling technique, called greedy sampling, can lead to behavior you described. Maybe another sampling technique could help (see beam search https://medium.com/geekculture/beam-search-decoding-for-text-generation-in-python-9184699f0120)?