Search code examples
pythontwittertwitter-streaming-api

Python API Streaming, write new file after certain size


I have a python script that maintains an open connection to the Twitter Streaming API, and writes the data into a json file. Is it possible to write to a new file, without breaking the connection, after the current file being written reaches a certain size? For example, I just streamed data for over 1 week, but all the data is contained in a single file (~2gb) making it slow to parse. If I could write to a new file after, say 500mb, then I would have 4 smaller files (e.g. dump1.json, dump2.json etc) to parse instead of one large one.

import tweepy
from tweepy import OAuthHandler
from tweepy import Stream
from tweepy.streaming import StreamListener

# Add consumer/access tokens for Twitter API
consumer_key = '-----'
consumer_secret = '-----'
access_token = '-----'
access_secret = '-----'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

# Define streamlistener class to open a connection to Twitter and begin consuming data
class MyListener(StreamListener):
def on_data(self, data):
    try:
        with open('G:\xxxx\Raw_tweets.json', 'a') as f:
            f.write(data)
            return True
    except BaseException as e:
        print("Error on_data: %s" % str(e))
        return True
def on_error(self, status):
   print(status)
   return True

bounding_box = [-77.2157,38.2036,-76.5215,39.3365]#filtering by location
keyword_list = ['']#filtering by keyword

twitter_stream = Stream(auth, MyListener())
twitter_stream.filter(locations=bounding_box) # Filter Tweets in stream by location bounding box
#twitter_stream.filter(track=keyword_list) # Filter Tweets in stream by keyword

Solution

  • Since you re-open your file every time, it is rather simple - use an index in file name and advance it if your file size reaches threshold

    class MyListener(StreamListener):
        def __init(self):
            self._file_index = 0
    
        def on_data(self, data):
            tweets_file = 'G:\xxxx\Raw_tweets{}.json'.format(self._file_index)
            while os.path.exists(tweets_file) and os.stat(tweet_file).st_size > 2**10:
                self._file_index += 1 
                tweets_file = 'G:\xxxx\Raw_tweets{}.json'.format(self._file_index)
    ....
    

    The cycle will take care of your app being restarted