Search code examples
pythonpandastwittersentiment-analysis

Why is my sentiment analysis running so slow?


I'm trying to make a GUI app where you enter the twitter hashtag of two different things and it compares them using sentiment analysis (I'm using movies right now as an example). My code isn't finished yet as I only have one hashtag showing so far. The end result is supposed to be a graph that shows the polarity of tweets (so far, it only shows polarity of one movie). While running my code works and will pop up a graph, it takes FOREVER most of the time. Sometimes it will load up quick like I expect, but any other time it takes so long I get impatient and re run the program. Is the way the code arranged/the modules being used causing this? Or is sentiment analysis generally very slow? This is my first sentiment analysis project so I'm not really sure. Here is my code, I've taken out the twitter keys and tokens as I'm not sure I can leave those there:

import tweepy as tw
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''

# authenticate twitter
auth = tw.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_token_secret)
api = tw.API(auth,wait_on_rate_limit= True)

# GET TWEETS HERE

hashtag = ("#GreenKnight",)
query = tw.Cursor(api.search, q = hashtag).items(1000)
tweets = [{'Tweets':tweet.text, 'Timestamp':tweet.created_at}for tweet in query]
# put tweets in pandas dataframe
df = pd.DataFrame.from_dict(tweets)
df.head()

# green knight movie references
green_knight_references = ["GreenKnight", "Green Knight", "green knight", "greenknight", "'The Green Knight'"]
def identify_subject(tweet,refs):
    flag = 0
    for ref in refs:
        if tweet.find(ref) != - 1:
            flag = 1
        return flag

df['Green Knight'] = df['Tweets'].apply(lambda x: identify_subject(x, green_knight_references))

df.head(10)

# time for stop words, to clear out the language not needed
import nltk
from nltk.corpus import stopwords
from textblob import Word, TextBlob
stop_words = stopwords.words("english")
custom_stopwords = ['RT']

def preprocess_tweets(tweet,custom_stopwords):
    preprocessed_tweet = tweet
    preprocessed_tweet.replace('{^\w\s}',"")
    preprocessed_tweet = " ".join(word for word in preprocessed_tweet.split() if word not in stop_words)
    preprocessed_tweet = " ".join(word for word in preprocessed_tweet.split() if word not in custom_stopwords)
    preprocessed_tweet = " ".join(Word(word).lemmatize() for word in preprocessed_tweet.split())
    return (preprocessed_tweet)


df['Processed Tweet'] = df['Tweets'].apply(lambda x: preprocess_tweets(x, custom_stopwords))
df.head()

#visualize

df['polarity'] = df['Processed Tweet'].apply(lambda x: TextBlob(x).sentiment[0])
df['subjectivity'] = df['Processed Tweet'].apply(lambda x: TextBlob(x).sentiment[1])
df.head()
(df[df['Green Knight']==1][['Green Knight','polarity','subjectivity']].groupby('Green Knight').agg([np.mean, np.max, np.min, np.median]))


green_knight = df[df['Green Knight']==1][['Timestamp', 'polarity']]
green_knight = green_knight.sort_values(by='Timestamp', ascending=True)
green_knight['MA Polarity'] = green_knight.polarity.rolling(10, min_periods=3).mean()

green_knight.head()

fig, axes = plt.subplots(2, 1, figsize=(13, 10))

axes[0].plot(green_knight['Timestamp'], green_knight['MA Polarity'])
axes[0].set_title("\n".join(["Green Knight Tweets"]))


fig.suptitle("\n".join(["Movie tweet polarity"]), y=0.98)

plt.show()

Solution

  • I've worked with tweepy before and the single-most slow thing was Twitter's API. It gets exhausted extremely quickly and without paying them, it's going to be frustrating :( .
    The sentiment analysis using TextBlob shouldn't be slow. However, your best bet is to use the cProfile option as @osint_alex mentioned in the comment, or for a simple solution just put some print statements in between the main 'blocks' of code.