Using Custom Word2Vec to find semantic similarity between technical questions?

We can get the similarity of two sentences like "The boy is playing football" and "A kid is playing football" using Google news vectors by applying "SIF Embeddings".

I would like to get the similarity for two sentences which are technical like "what is an abstract class?" and "what is a class?".

I have used Google-news Vectors in getting the similarity but it didn't work well.

I would like to know how training data should be?

Solution

Word2Vec is an algorithm that generates vectors for words, that tend to be similar for similar words. It does not do sentences on its own.

You have more or less the following options:

Create a sentence vector
Compare similarity of word vectors within two sentences

Create a sentence vector

You could build sentence, paragraph or document vectors. There are different approaches to that. You could for example combine the word2vec of of the individual words. If you just want a solution you could go for gensim's doc2vec: https://radimrehurek.com/gensim/models/doc2vec.html

Other than that there are methods like concatenating all the word vectors (of a fixed length).

Compare similarity of word vectors within two sentences

One such approach is Movers Distance: Pairwise Earth Mover Distance across all documents (word2vec representations)

This seems like a good, but expensive approach.

Update: You've updated your question since to mention that you are using "SIF Embeddings" (instead of word2vec): https://openreview.net/forum?id=SyK00v5xx