Search code examples
nlpnltkgensimword2vecfasttext

Is there a semantic similarity method that outperforms word2vec approach for semantic accuracy?


I am looking at various semantic similarity methods such as word2vec, word mover distance (WMD), and fastText. fastText is not better than Word2Vec as for as semantic similarity is concerned. WMD and Word2Vec have almost similar results.

I was wondering if there is an alternative which has outperformed the Word2Vec model for semantic accuracy?

My use case: Finding word embeddings for two sentences, and then use cosine similarity to find their similarity.


Solution

  • Whether any technique "outperforms" another will depend highly on your training data, the specific metaparameter options you choose, and your exact end-task. (Even "semantic similarity" may have many alternate aspects depending on the application.)

    There's no one way to go from word2vec word-vectors to a sentence/paragraph vector. You could add the raw vectors. You could average the unit-normalized vectors. You could perform some other sort of weighted-average, based on other measures of word-significance. So your implied baseline is unclear.

    Essentially you have to try a variety of methods and parameters, for your data and goal, with your custom evaluation.

    Word Mover's Distance doesn't reduce each text to a single vector, and the pairwise calculation between two texts can be expensive, but it has reported very good performance on some semantic-similarity tasks.

    FastText is essentially word2vec with some extra enhancements and new modes. Some modes with the extras turned off are exactly the same as word2vec, so using FastText word-vectors in some wordvecs-to-textvecs scheme should closely approximate using word2vec word-vectors in the same scheme. Some modes might help the word-vector quality for some purposes, but make the word-vectors less effective inside a wordvecs-to-textvecs scheme. Some modes might make the word-vector better for sum/average composition schemes – you should look especially at the 'classifier' mode, which trains word-vecs to be good, when averaged, at a classification task. (To the extent you may have any semantic labels for your data, this might make the word-vecs more composable for semantic-similarity tasks.)

    You may also want to look at the 'Paragraph Vectors' technique (available in gensim as Doc2Vec), or other research results that go by the shorthand names 'fastSent' or 'sent2vec'.