Search code examples
datasetartificial-intelligencegensimsentiment-analysisdoc2vec

Data set for Doc2Vec general sentiment analysis


I am trying to build doc2vec model, using gensim + sklearn to perform sentiment analysis on short sentences, like comments, tweets, reviews etc.

I downloaded amazon product review data set, twitter sentiment analysis data set and imbd movie review data set.

Then combined these in 3 categories, positive, negative and neutral.

Next I trinaed gensim doc2vec model on the above data so I can obtain the input vectors for the classifying neural net.

And used sklearn LinearReggression model to predict on my test data, which is about 10% from each of the above three data sets.

Unfortunately the results were not good as I expected. Most of the tutorials out there seem to focus only on one specific task, 'classify amazon reviews only' or 'twitter sentiments only', I couldn't manage to find anything that is more general purpose.

Can some one share his/her thought on this?


Solution

  • How good did you expect, and how good did you achieve?

    Combining the three datasets may not improve overall sentiment-detection ability, if the signifiers of sentiment vary in those different domains. (Maybe, 'positive' tweets are very different in wording than product-reviews or movie-reviews. Tweets of just a few to a few dozen words are often quite different than reviews of hundreds of words.) Have you tried each separately to ensure the combination is helping?

    Is your performance in line with other online reports of using roughly the same pipeline (Doc2Vec + LinearRegression) on roughly the same dataset(s), or wildly different? That will be a clue as to whether you're doing something wrong, or just have too-high expectations.

    For example, the doc2vec-IMDB.ipynb notebook bundled with gensim tries to replicate an experiment from the original 'Paragraph Vector' paper, doing sentiment-detection on an IMDB dataset. (I'm not sure if that's the same dataset as you're using.) Are your results in the same general range as that notebook achieves?

    Without seeing your code, and details of your corpus-handling & parameter choices, there could be all sorts of things wrong. Many online examples have nonsense choices. But maybe your expectations are just off.