I'm trying to make a GUI app where you enter the twitter hashtag of two different things and it compares them using sentiment analysis (I'm using movies right now as an example). My code isn't finished yet as I only have one hashtag showing so far. The end result is supposed to be a graph that shows the polarity of tweets (so far, it only shows polarity of one movie). While running my code works and will pop up a graph, it takes FOREVER most of the time. Sometimes it will load up quick like I expect, but any other time it takes so long I get impatient and re run the program. Is the way the code arranged/the modules being used causing this? Or is sentiment analysis generally very slow? This is my first sentiment analysis project so I'm not really sure. Here is my code, I've taken out the twitter keys and tokens as I'm not sure I can leave those there:
import tweepy as tw
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
consumer_key = ''
consumer_secret = ''
access_token = ''
access_token_secret = ''
# authenticate twitter
auth = tw.OAuthHandler(consumer_key,consumer_secret)
auth.set_access_token(access_token,access_token_secret)
api = tw.API(auth,wait_on_rate_limit= True)
# GET TWEETS HERE
hashtag = ("#GreenKnight",)
query = tw.Cursor(api.search, q = hashtag).items(1000)
tweets = [{'Tweets':tweet.text, 'Timestamp':tweet.created_at}for tweet in query]
# put tweets in pandas dataframe
df = pd.DataFrame.from_dict(tweets)
df.head()
# green knight movie references
green_knight_references = ["GreenKnight", "Green Knight", "green knight", "greenknight", "'The Green Knight'"]
def identify_subject(tweet,refs):
flag = 0
for ref in refs:
if tweet.find(ref) != - 1:
flag = 1
return flag
df['Green Knight'] = df['Tweets'].apply(lambda x: identify_subject(x, green_knight_references))
df.head(10)
# time for stop words, to clear out the language not needed
import nltk
from nltk.corpus import stopwords
from textblob import Word, TextBlob
stop_words = stopwords.words("english")
custom_stopwords = ['RT']
def preprocess_tweets(tweet,custom_stopwords):
preprocessed_tweet = tweet
preprocessed_tweet.replace('{^\w\s}',"")
preprocessed_tweet = " ".join(word for word in preprocessed_tweet.split() if word not in stop_words)
preprocessed_tweet = " ".join(word for word in preprocessed_tweet.split() if word not in custom_stopwords)
preprocessed_tweet = " ".join(Word(word).lemmatize() for word in preprocessed_tweet.split())
return (preprocessed_tweet)
df['Processed Tweet'] = df['Tweets'].apply(lambda x: preprocess_tweets(x, custom_stopwords))
df.head()
#visualize
df['polarity'] = df['Processed Tweet'].apply(lambda x: TextBlob(x).sentiment[0])
df['subjectivity'] = df['Processed Tweet'].apply(lambda x: TextBlob(x).sentiment[1])
df.head()
(df[df['Green Knight']==1][['Green Knight','polarity','subjectivity']].groupby('Green Knight').agg([np.mean, np.max, np.min, np.median]))
green_knight = df[df['Green Knight']==1][['Timestamp', 'polarity']]
green_knight = green_knight.sort_values(by='Timestamp', ascending=True)
green_knight['MA Polarity'] = green_knight.polarity.rolling(10, min_periods=3).mean()
green_knight.head()
fig, axes = plt.subplots(2, 1, figsize=(13, 10))
axes[0].plot(green_knight['Timestamp'], green_knight['MA Polarity'])
axes[0].set_title("\n".join(["Green Knight Tweets"]))
fig.suptitle("\n".join(["Movie tweet polarity"]), y=0.98)
plt.show()
I've worked with tweepy
before and the single-most slow thing was Twitter's API. It gets exhausted extremely quickly and without paying them, it's going to be frustrating :( .
The sentiment analysis using TextBlob
shouldn't be slow.
However, your best bet is to use the cProfile
option as @osint_alex mentioned in the comment, or for a simple solution just put some print statements in between the main 'blocks' of code.