My text is derived from a social network, so you can imagine it's nature, I think text is clean and minimal as far as I could imagine; after performing following sanitization:
I think run time is linear, and I don’t intend to do any parallelization because of the amount of effort needed to change available code, For a example, for around 1000 texts ranging from ~50 kb to ~150 kb bytes, it takes around
and the running time is around 10 minutes on my machine.
Is there a better way in feeding the algorithm to speed up cooking time? The code is as simple as SentimentIntensityAnalyzer is intended to work, here is the main part
sid = SentimentIntensityAnalyzer()
c.execute("select body, creation_date, group_id from posts where (substring(lower(body) from (%s))=(%s)) and language=\'en\' order by creation _ date DESC (s,s,)")
conn.commit()
if(c.rowcount>0):
dump_fetched = c.fetchall()
textsSql=pd.DataFrame(dump_fetched,columns=['body','created_at', 'group_id'])
del dump_fetched
gc.collect()
texts = textsSql['body'].values
# here, some data manipulation: steps listed above
polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]
/1. You need not remove the stopwords, nltk+vader already does that.
/2. You need not remove the punctuation, as that affects vader's polarity calculations too, apart from the processing overhead. So, go ahead with the punctuation.
>>> txt = "this is superb!"
>>> s.polarity_scores(txt)
{'neg': 0.0, 'neu': 0.313, 'pos': 0.687, 'compound': 0.6588}
>>> txt = "this is superb"
>>> s.polarity_scores(txt)
{'neg': 0.0, 'neu': 0.328, 'pos': 0.672, 'compound': 0.6249}
/3.You shall introduce sentence tokenization too, as it would improve the accuracy, and then calculate average polarity for a paragraph based on the sentences.Example here : https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py#L517
/4. The polarity calculations are completely independent of each other, and can use a multiprocessing pool for a small size, say 10, to provide good boost in speed.
polarity_ = [sid.polarity_scores(s)['compound'] for s in texts]