Search code examples
pythonalgorithmword-embeddingdoc2vec

How to measure the word weight using doc2vec vector


I'm using the word2vec algorithm to detect the most important words in a document, my question is about how to compute the weight of an important word using the vector obtained from doc2vec, my code is like that:

model = Doc2Vec.load(fname)
word=["suddenly"]
vectors=model.infer_vector(word)

thank you for your consideration.


Solution

  • Let's say you can find the vector R corresponding to an entire document, using doc2vec. Let's also assume that using word2vec, you can find the vector v corresponding to any word w as well. And finally, let's assume that R and v are in the same N-dimensional space.

    Assuming all this, you may then utilize plain old vector arithmetic to find out some correlations between R and v.

    For starters, you can normalize v. Normalizing, after all, is just dividing each dimension with the magnitude of v. (i.e. |v|) Let's call the normalized version of v as v_normal.

    Then, you may project v_normal onto the line represented by the vector R. That projection operation is just finding the dot product of v_normal and R, right? Let's call that scalar result of the dot product as len_projection. Well, you can consider len_projection / |v_normal| as an indication of how parallel the context of the word is to that of the overall document. In fact, considering just len_projection is enough, because in this case, as v_normal is normalized, |v_normal| == 1.

    Now, you may apply this procedure to all words in the document, and consider the words that lead to greatest len_projection values as the most significant words of that document.

    Note that this method may end up finding frequently-used words like "I", or "and" as the most important words in a document, as such words appear in many different contexts. If that is an issue you would want to remedy, you may want do a post-processing step to filter such common words.

    I sort of thought of this method on the spot here, and I am not sure whether this approach has any scientific backing. But, it may make sense, if you think about how most vector embeddings for words work. Word vectors are usually trained to represent the context in which a word is used. Thinking in terms of vector arithmetic, projecting the vector onto a line may reveal how parallel the context of that word w is, to the overall context represented by that line.

    Last but not least, as I have worked only with word2vec before, I am not sure if doc2vec and word2vec data can be used adjointly like I mentioned above. As I stated in the first paragraph of my answer, it is really critical that R and v has to be in the same N-dimensional space.