I'm using tweepy to datamine the public stream of tweets for keywords. This is pretty straightforward and has been described in multiple places:
http://runnable.com/Us9rrMiTWf9bAAW3/how-to-stream-data-from-twitter-with-tweepy-for-python
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Copying code directly from the second link:
#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
#Variables that contains the user credentials to access Twitter API
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET"
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
print data
return True
def on_error(self, status):
print status
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['python', 'javascript', 'ruby'])
What I can't figure out is how can I stream this data into a python variable? Instead of printing it to the screen... I'm working in an ipython notebook and want to capture the stream in some variable, foo
after streaming for a minute or so. Furthermore, how do I get the stream to timeout? It runs indefinitely in this manner.
Yes, in the post, @Adil Moujahid mentions that his code ran for 3 days. I adapted the same code and for initial testing, did the following tweaks:
a) Added a location filter to get limited tweets instead of universal tweets containing the keyword. See How to add a location filter to tweepy module. From here, you can create an intermediate variable in the above code as follows:
stream_all = Stream(auth, l)
Suppose we, select San Francisco area, we can add:
stream_SFO = stream_all.filter(locations=[-122.75,36.8,-121.75,37.8])
It is assumed that the time to filter for location is lesser than filter for the keywords.
(b) Then you can filter for the keywords:
tweet_iter = stream_SFO.filter(track=['python', 'javascript', 'ruby'])
(c) You can then write it to file as follows:
with open('file_name.json', 'w') as f:
json.dump(tweet_iter,f,indent=1)
This should take much lesser time. I co-incidently wanted to address the same question that you have posted today. Hence, I don't have the execution time.
Hope this helps.