Search code examples
pytorchtensorembeddingbert-language-modelloss

Pytorch Loss Function for making embeddings similar


I am working on an embedding model, where there is a BERT model, which takes in text inputs and output a multidimensional vector. The goal of the model is to find similar embeddings (high cosine similarity) for texts which are similar and different embeddings (low cosine similarity) for texts that are dissimilar.

When training in mini-batch mode, the BERT model gives a N*D dimensional output where N is the batch size and D is the output dimension of the BERT model.

Also, I have a target matrix of dimension N*N, which contains 1 in the [i, j] th position if the sentence[i] and sentence[j] are similar in sense and -1 if not.

What I want to do is find the loss/error for the entire batch by finding the cosine similarity of all embeddings in the BERT output and comparing it to the target matrix.

What I did was simply multiply the tensor with its transpose and then take elementwise sigmoid.

scores = torch.matmul(document_embedding, torch.transpose(document_embedding, 0, 1))
scores = torch.sigmoid(scores)

loss = self.bceloss(scores, targets)

But this does not seem to work.

Is there any other way to do this?

P.S. What I want to do is similar to the method described in this paper.


Solution

  • To calculate the cosine similarity between two vectors you would have used nn.CosineSimilarity. However, I don't think this allows you to get the pair-similarity from a set of n vectors. Fortunately enough, you can implement it yourself with some tensor manipulation.

    Let us call x your document_embedding of shape (n, d) where d is the embedding size. We'll take n=3 and d=5. So x is made up of [x1, x2, x3].T.

    >>> x = torch.rand(n, d)
    tensor([[0.8620, 0.9322, 0.4220, 0.0280, 0.3789],
            [0.2747, 0.4047, 0.6418, 0.7147, 0.3409],
            [0.6573, 0.3432, 0.5663, 0.2512, 0.0582]])
    

    The cosine similarity is a normalized dot product. The [email protected] matrix multiplication will give you the pairwise dot product: which contains: ||x1||², <x1/x2>, <x1/x3>, <x2/x1>, ||x2||², etc...

    >>> sim = [email protected]
    tensor([[1.9343, 1.0340, 1.1545],
            [1.0340, 1.2782, 0.8822],
            [1.1545, 0.8822, 0.9370]])
    

    To normalize take the vector of all norms: ||x1||, ||x2||, and ||x3||:

    >>> norm = x.norm(dim=1)
    tensor([1.3908, 1.1306, 0.9680])
    

    Construct the matrix containing the normalization factors: ||x1||², ||x1||.||x2||, ||x1||.||x3||, ||x2||.||x1||, ||x2||², etc...

    >>> factor = norm*norm.unsqueeze(1)
    tensor([[1.9343, 1.5724, 1.3462],
            [1.5724, 1.2782, 1.0944],
            [1.3462, 1.0944, 0.9370]])
    

    Then normalize:

    >>> sim /= factor
    tensor([[1.0000, 0.6576, 0.8576],
            [0.6576, 1.0000, 0.8062],
            [0.8576, 0.8062, 1.0000]])
    

    Alternatively, a quicker way which avoids having to create the norm matrix, is to normalize before multiplying:

    >>> x /= x.norm(dim=1, keepdim=True)
    >>> sim = [email protected]
    tensor([[1.0000, 0.6576, 0.8576],
            [0.6576, 1.0000, 0.8062],
            [0.8576, 0.8062, 1.0000]])
    

    For the loss function I would apply nn.CrossEntropyLoss straight away between the predicted similarity matrix and the target matrix, instead of applying sigmoid + BCE. Note: nn.CrossEntropyLoss includes nn.LogSoftmax.