Search code examples
pythonperformancedata-manipulationsentiment-analysisvader

Is there a way to improve performance of nltk.sentiment.vader Sentiment analyser?


My text is derived from a social network, so you can imagine it's nature, I think text is clean and minimal as far as I could imagine; after performing following sanitization:

  • no urls, no usernames
  • no punctuation, no accents
  • no numbers
  • no stopwords (I think vader does this anyway)

I think run time is linear, and I don’t intend to do any parallelization because of the amount of effort needed to change available code, For a example, for around 1000 texts ranging from ~50 kb to ~150 kb bytes, it takes around

and the running time is around 10 minutes on my machine.

Is there a better way in feeding the algorithm to speed up cooking time? The code is as simple as SentimentIntensityAnalyzer is intended to work, here is the main part

sid = SentimentIntensityAnalyzer()

c.execute("select body, creation_date, group_id from posts where (substring(lower(body) from (%s))=(%s)) and language=\'en\' order by creation _ date DESC (s,s,)")
conn.commit()
if(c.rowcount>0):
                dump_fetched = c.fetchall()

textsSql=pd.DataFrame(dump_fetched,columns=['body','created_at', 'group_id'])
del dump_fetched
gc.collect()
texts = textsSql['body'].values
# here, some data manipulation: steps listed above
polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]

Solution

  • /1. You need not remove the stopwords, nltk+vader already does that.

    /2. You need not remove the punctuation, as that affects vader's polarity calculations too, apart from the processing overhead. So, go ahead with the punctuation.

        >>> txt = "this is superb!"
        >>> s.polarity_scores(txt)
        {'neg': 0.0, 'neu': 0.313, 'pos': 0.687, 'compound': 0.6588}
        >>> txt = "this is superb"
        >>> s.polarity_scores(txt)
        {'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'compound': 0.6249}
    

    /3.You shall introduce sentence tokenization too, as it would improve the accuracy, and then calculate average polarity for a paragraph based on the sentences.Example here : https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py#L517

    /4. The polarity calculations are completely independent of each other, and can use a multiprocessing pool for a small size, say 10, to provide good boost in speed.

    polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]