Search code examples
nlpdata-scienceword2vec

How does word2vec work to find sentence similarity?


I am using word2vec/doc2vec to find text similarities of two documents. I studied that word2vec works on two approaches :

  • CBOW : which predicts words on the basis of its context
  • Skipgram : which predicts context on the basis of the word

But I am stuck at understanding how these two approaches works in calculating the text similarities. Also which one is the better approach for the current task.


Solution

  • Word vectors just model individual words.

    But, you can then use these per-word vectors to create vectors for larger texts, or similarity-calculations between larger texts.

    A simple way to turn a text into a single fixed-width vector is to average the word-vectors of all the text's words. (This could also be a weighted average, based on some ideas of individual words' importance.) This sort of text-vector can often work well as a quick and simple baseline. For two texts, the cosine similarity of the two averages-of-all-their-word-vectors is then the similarity of two texts.

    An algorithm like Doc2Vec (aka "Paragraph Vector") is an alternative way to get a vector for a text. It doesn't strictly combine word-vectors, but rather uses a process like what is used to create word-vectors to create per-text vectors instead.

    If just working with word-vectors, another option for text-to-text similarity is "Word Mover's Distance" (WMD). Rather than averaging all word-vectors together, to create a single vector for the text, the WMD measure treats all the words of a text as "piles of meaning" at their various word-vectors' coordinates. The distance between texts is how much effort is required to "move" the mass of one text's word-vectors to the other's. It's expensive (since each such pairwise calc is an optimization problem among many possible word-to-word shifts) but retains a bit more distinction than just collapsing a text into a single summary vector.