While learning Doc2Vec library, I got stuck on the following question.
Do gensim Doc2Vec distinguish between the same Sentence with positive and negative context?
For Example:
Sentence A: "I love Machine Learning"
Sentence B: "I do not love Machine Learning"
If I train sentence A and B with doc2vec and find cosine similarity between their vectors:
Also, If I train only on sentence A and try to infer Sentence B, will both vectors be close to each other in vector space.?
I would request the NLP community and Doc2Vec experts for helping me out in understanding this.
Thanks in Advance !!
Inherently, all that the 'Paragraph Vector' algorithm behind gensim Doc2Vec
does is find a vector that (together with a neural-network) is good at predicting the words that appear in a text. So yes, texts with almost-identical words will have very close vectors. (There's no syntactic understanding that certain words, in certain places, have a big reversing-effect.)
However, even such vectors may be ok (though not state-of-the-art) at sentiment analysis. One of the ways the original 'Paragraph Vectors' paper evaluated the vector usability was estimating the sentiment of short movie reviews. (These were longer than a single sentence – into the hundreds of words.) When training a classifier on the doc-vectors, the classifier did a pretty good job, and better than other baseline techniques, at estimating the negativity/positivity of reviews.
Your single, tiny, contrived sentences could be harder – they're short with just a couple words' difference, so the vectors will be very close. But those different words (especially 'not'
) are often very indicative of sentiment – so the tiny difference might be enough to shift the vector from the 'positive' regions to the 'negative' regions.
So you'd have to try it, with a real training corpus of tens of thousands of varied text examples (because this technique doesn't work well on toy-sized datasets) and a post-vectorization classifier step.
Note also that in pure Doc2Vec
, adding known labels (like 'positive' or 'negative') during training (alongside or instead of any unique document-ID based tags) can sometimes help the resulting vector-space be more sensitive to the distinction you want. And, other variant techniques like 'FastText' or 'StarSpace' more directly integrate known-labels into the vectorization in a way that might help.
The best results on short sentences, though, would probably take into account the relative ordering of words and grammatical parsing. You can see a demo of such a more-advanced technique at a page from Stanford's NLP research group:
http://nlp.stanford.edu:8080/sentiment/rntnDemo.html
Though look in the comments there for various examples of hard cases that it still struggles with.