Search code examples
pythonnlpsemanticsword2vec

Using Custom Word2Vec to find semantic similarity between technical questions?


We can get the similarity of two sentences like "The boy is playing football" and "A kid is playing football" using Google news vectors by applying "SIF Embeddings".

I would like to get the similarity for two sentences which are technical like "what is an abstract class?" and "what is a class?".

I have used Google-news Vectors in getting the similarity but it didn't work well.

I would like to know how training data should be?


Solution

  • Word2Vec is an algorithm that generates vectors for words, that tend to be similar for similar words. It does not do sentences on its own.

    You have more or less the following options:

    • Create a sentence vector
    • Compare similarity of word vectors within two sentences

    Create a sentence vector

    You could build sentence, paragraph or document vectors. There are different approaches to that. You could for example combine the word2vec of of the individual words. If you just want a solution you could go for gensim's doc2vec: https://radimrehurek.com/gensim/models/doc2vec.html

    Other than that there are methods like concatenating all the word vectors (of a fixed length).

    Similar questions: How to calculate the sentence similarity using word2vec model of gensim with python

    Compare similarity of word vectors within two sentences

    One such approach is Movers Distance: Pairwise Earth Mover Distance across all documents (word2vec representations)

    This seems like a good, but expensive approach.

    Update: You've updated your question since to mention that you are using "SIF Embeddings" (instead of word2vec): https://openreview.net/forum?id=SyK00v5xx