Search code examples
pythonpandasnumpytwittertweepy

How to stream Tweets by hashtag with language AND count filter using Tweepy?


So what I want to do is live stream Tweets from Twitters API: for just the hashtag 'Brexit', only in the English language, and for a specific amount of Tweets (1k - 2k).

So far my code will live stream the Tweets, but whichever way I modify it I either end up with it ignoring the count and just streaming indefinitely, or I get errors. If I change it to only stream a specific users Tweets the count function works, but it ignores the hashtag. If I stream everything for the given hashtag it completely ignores the count. I've had a decent go at trying to fix it but am quite inexperienced and have really hit a brick wall with it.

If I could get some help with how to tick all these boxes at the same time would be much appreciated! The code below so far will just stream 'Brexit' Tweets indefinitely so ignores the count=10

The bottom of the code is a bit of a mess due to me playing with it, apologies:

import numpy as np
import pandas as pd
import tweepy
from tweepy import API
from tweepy import Cursor
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import Twitter_Credentials
import matplotlib.pyplot as plt

# Twitter client - hash out to stream all


class TwitterClient:
    def __init__(self, twitter_user=None):
        self.auth = TwitterAuthenticator().authenticate_twitter_app()
        self.twitter_client = API(self.auth)

        self.twitter_user = twitter_user

    def get_twitter_client_api(self):
        return self.twitter_client

# Twitter authenticator


class TwitterAuthenticator:
    def authenticate_twitter_app(self):
        auth = OAuthHandler(Twitter_Credentials.consumer_key, Twitter_Credentials.consumer_secret)
        auth.set_access_token(Twitter_Credentials.access_token, Twitter_Credentials.access_secret)
        return auth

class TwitterStreamer():
    # Class for streaming and processing live Tweets
    def __init__(self):
        self.twitter_authenticator = TwitterAuthenticator()

    def stream_tweets(self, fetched_tweets_filename, hash_tag_list):

        # this handles Twitter authentication and connection to Twitter API
        listener = TwitterListener(fetched_tweets_filename)
        auth = self.twitter_authenticator.authenticate_twitter_app()
        stream = Stream(auth, listener)
        # This line filters Twitter stream to capture data by keywords
        stream.filter(track=hash_tag_list)

# Twitter stream listener

class TwitterListener(StreamListener):
    # This is a listener class that prints incoming Tweets to stdout
    def __init__(self, fetched_tweets_filename):
        self.fetched_tweets_filename = fetched_tweets_filename

    def on_data(self, data):
        try:
            print(data)
            with open(self.fetched_tweets_filename, 'a') as tf:
                tf.write(data)
            return True
        except BaseException as e:
            print("Error on_data: %s" % str(e))
        return True

    def on_error(self, status):
        if status == 420:
            # Return false on data in case rate limit occurs
            return False
        print(status)

class TweetAnalyzer():
    # Functionality for analysing and categorising content from tweets

    def tweets_to_data_frame(self, tweets):
        df = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['tweets'])

        df['id'] = np.array([tweet.id for tweet in tweets])
        df['len'] = np.array([len(tweet.text) for tweet in tweets])
        df['date'] = np.array([tweet.created_at for tweet in tweets])
        df['source'] = np.array([tweet.source for tweet in tweets])
        df['likes'] = np.array([tweet.favorite_count for tweet in tweets])
        df['retweets'] = np.array([tweet.retweet_count for tweet in tweets])

        return df


if __name__ == "__main__":

    auth = OAuthHandler(Twitter_Credentials.consumer_key, Twitter_Credentials.consumer_secret)
    auth.set_access_token(Twitter_Credentials.access_token, Twitter_Credentials.access_secret)
    api = tweepy.API(auth)

    for tweet in Cursor(api.search, q="#brexit", count=10,
                               lang="en",
                               since="2019-04-03").items():
        fetched_tweets_filename = "tweets.json"
        twitter_streamer = TwitterStreamer()
        hash_tag_list = ["Brexit"]
        twitter_streamer.stream_tweets(fetched_tweets_filename, hash_tag_list)

Solution

  • You're trying to use two different methods of accessing the Twitter API - Streaming is realtime, and searching is a one-off API call.

    Since streaming is continuous and realtime, there's no way to apply a count of results to it - the code simply opens a connection, says "hey, send me all the Tweets from now onwards that contain the hash_tag_list", and sits listening. At that point you then drop into the StreamListener, where for each Tweet received, you write them into a file.

    You could apply a counter here, but you'd need to wrap it inside your StreamListener on_data handler, and increment the counter for each Tweet received. When you get to 1000 Tweets, stop listening.

    For the search option, you have a couple of issues... the first one is that you're asking for Tweets since 2019, but the standard search API can only go back 7 days in time. You've obviously asked for only 10 Tweets there. The way you've written the method though, what's actually happening is that for each Tweet in the collection of 10 that the API returns, you then create a realtime streaming connection and start listening and writing to a file. So that's not going to work.

    You'll need to choose one - either search for 1000 Tweets and write them to a file (never set up TwitterStreamer()), or, listen for 1000 Tweets and write them to a file (drop the for Tweet in Cursor(api.search... and jump straight to the streamer).