Search code examples
pythontwittertweepy

Tweepy: Stream data for X minutes?


I'm using tweepy to datamine the public stream of tweets for keywords. This is pretty straightforward and has been described in multiple places:

http://runnable.com/Us9rrMiTWf9bAAW3/how-to-stream-data-from-twitter-with-tweepy-for-python

http://adilmoujahid.com/posts/2014/07/twitter-analytics/

Copying code directly from the second link:

#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#Variables that contains the user credentials to access Twitter API 
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET"


#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):

    def on_data(self, data):
        print data
        return True

    def on_error(self, status):
        print status


if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
    stream.filter(track=['python', 'javascript', 'ruby'])

What I can't figure out is how can I stream this data into a python variable? Instead of printing it to the screen... I'm working in an ipython notebook and want to capture the stream in some variable, foo after streaming for a minute or so. Furthermore, how do I get the stream to timeout? It runs indefinitely in this manner.

Related:

Using tweepy to access Twitter's Streaming API


Solution

  • Yes, in the post, @Adil Moujahid mentions that his code ran for 3 days. I adapted the same code and for initial testing, did the following tweaks:

    a) Added a location filter to get limited tweets instead of universal tweets containing the keyword. See How to add a location filter to tweepy module. From here, you can create an intermediate variable in the above code as follows:

    stream_all = Stream(auth, l)
    

    Suppose we, select San Francisco area, we can add:

    stream_SFO = stream_all.filter(locations=[-122.75,36.8,-121.75,37.8])  
    

    It is assumed that the time to filter for location is lesser than filter for the keywords.

    (b) Then you can filter for the keywords:

    tweet_iter = stream_SFO.filter(track=['python', 'javascript', 'ruby']) 
    

    (c) You can then write it to file as follows:

    with open('file_name.json', 'w') as f:
            json.dump(tweet_iter,f,indent=1)
    

    This should take much lesser time. I co-incidently wanted to address the same question that you have posted today. Hence, I don't have the execution time.

    Hope this helps.