Search code examples
pythonsentiment-analysistextblobvader

How could I improve the accuracy of sentiment analysis of news headlines?


I'm using Vader and TextBlob to analyse the sentiment of news headlines with mixed results: many headlines I would consider slightly negative are scored as neutral. Here are a few examples:

Who wants to live in an artificially intelligent future?
# Vader: {'compound': 0.4588, 'pos': 0.273, 'neu': 0.727, 'neg': 0.0}
# TextBlob: Sentiment(polarity=0.2840909090909091, subjectivity=0.40625)

The internet and social media provide huge opportunities for the coming generation, but there’s a dark side from which it must be protected.
# Vader: {'compound': 0.743, 'pos': 0.278, 'neu': 0.722, 'neg': 0.0}
# TextBlob: Sentiment(polarity=0.09444444444444448, subjectivity=0.45555555555555555)

For three months I’ve lived without tech and now realise we need to question its ever-encroaching invasion – before we end up in bed with a sex robot.
# Vader {'compound': 0.0, 'pos': 0.0, 'neu': 1.0, 'neg': 0.0}
# TextBlob Sentiment(polarity=0.0, subjectivity=0.0)

I think the first sentence could be read either way, but the second two definitely have negative elements to them: "there’s a dark side" and "its ever-encroaching invasion", so I'm surprised to see Vader give both a negative sore of 0 and TextBlob to give a polarity of 0 or above.

Are these kind of texts just fundamentally difficult for sentiment analysis algorithms, or is there another approach I could consider?

The attraction of the libraries I mentioned is that I don't have to make my own classification dataset, but I might consider it if I was likely to get better results.


Solution

  • The basic difference is that most current tools work on a sentiment index of individual words. For instance, finding "like" or "excellent" anywhere in the text will signal a positive evaluation. Your examples depends more on some "understanding" of the phrases, requiring minimal parsing. That's a more detailed process, requiring a deeper understanding of the language semantics.

    One way you could attack this is to fill the lexicon with indexed phrases (inserted as words) as well as words. Then you pre-process the input to convert those phrases to whatever indication you've used in the lexicon. For instance, join those phrases with underscores -- and "dark_side" is in your lexicon with a negative index.

    I'm hopeful this gives you a nudge in a useful direction.