Choosing loss function for lstm trained on word2vec vectors when target is a vector of same dimensions

I have an lstm I'm using as a sequence-generator trained on word2vec vectors. The previous implementation produced a probability distribution for all the different labels. There was one label for every word in the vocabulary. This implementation used Pytorch's CrossEntropyLoss. I now want to change this so the lstm outputs a vector that has the same dimensions as the vectors used for training. This way I could use the euclydian distance measure to match wit to nearby vectors in the vocabulary. The problem is that in order to do this, I have to use a different loss function, because CrossEntropyLoss is appropriate for classifiers and not for regression problems.

I tried changing the format of the target vector, but torch's CrossEntropyLoss function requires integer input, and I have a word vector. Having looked at a few options it seems Cosine Embedding Loss might be a good idea but I don't understand how it works and what kind of input it takes.

I have already changed my fully connected layer to output vectors of the same dimensions as the Word Embeddings used for training:

nn.Linear(in_features=self.cfg.lstm.lstm_num_hidden,out_features=self.cfg.lstm.embedding_dim,bias=True)

Any advice and examples would be much appreciated.

Solution

As the documentation of CosineEmbeddingLoss says:

Creates a criterion that measures the loss given two input tensors and a Tensor label with values 1 or -1.

In your scenario, you should always provide 1 as the Tensor label.

batch_size, seq_len, w2v_dim = 32, 100, 200
x1 = torch.randn(batch_size, seq_len, w2v_dim)
x2 = torch.randn(batch_size, seq_len, w2v_dim)
y = torch.ones(batch_size, seq_len)
loss_fn = torch.nn.CosineEmbeddingLoss(reduction='none')
loss = loss_fn(x1.view(-1, w2v_dim), 
               x2.view(-1, w2v_dim),  
               y.view(-1))
loss = loss.view(batch_size, seq_len)

Here, I assume x1 is the word embeddings, x2 is the output of the LSTM followed by some transformation.

Why should I always provide 1 as the Tensor label?

First, you should see the loss function.

In your scenario, the higher the cosine similarity is, the lower the loss should be. In other words, you want to maximize the cosine similarity. So, you need to provide 1 as the label.

On the other hand, if you want to minimize the cosine similarity, you need to provide -1 as the label.