I have a script to harvest twitter data according to id's stored in an xml, but it does not fetch everything. After some time it just gets empty messages. From 2000 ids i managed to save ~ 200 tweets. Any idea how to fix this?
import xml.etree.ElementTree as xml
import urllib2
import sys
startIter = int(sys.argv[1])
stopIter = int(sys.argv[2])
#Open file to write JSON to
jsonFile = open('jSONfile', 'a')
#Parse XML directly from the file path
tree = xml.parse("twitter.xml")
#Get the root node
rootElement = tree.getroot()
#Loop through nodes in root
iterator = 1
for node in rootElement:
if iterator >= startIter and iterator <= stopIter:
print iterator
print node[0].text
nodeID = node[0].text
try:
tweet = urllib2.urlopen('https://api.twitter.com/1/statuses/show.json?id={0}&include_entities=true'.format(nodeID))
tweetData = tweet.read()
print tweetData
jsonFile.write('{0}\n'.format(tweetData).',')
except:
pass
iterator = iterator + 1
jsonFile.close()
Twitter API's have got strict API limits. They throttle their API's. If you are frequently hitting their API's it is very likely that they would stop serving content to you either permanently or for a fixed period of time. To get an idea of what exactly are the limits check there API Rate Limiting and Rate limits
Also Twitter themselves admit that with the amount of data they have to deal with their normal API's serve about 1% of the actual data coming in. If you want the entire data set for your particular API type then you need to access their Twitter Firehouse API's.