Search code examples
pythonjsontwitter

Harvesting twitter data


I have a script to harvest twitter data according to id's stored in an xml, but it does not fetch everything. After some time it just gets empty messages. From 2000 ids i managed to save ~ 200 tweets. Any idea how to fix this?

import xml.etree.ElementTree as xml
import urllib2
import sys

startIter = int(sys.argv[1])
stopIter = int(sys.argv[2])

#Open file to write JSON to
jsonFile = open('jSONfile', 'a')
#Parse XML directly from the file path
tree = xml.parse("twitter.xml")

#Get the root node
rootElement = tree.getroot()

#Loop through nodes in root
iterator = 1
for node in rootElement:
    if iterator >= startIter and iterator <= stopIter:
        print iterator
        print node[0].text
        nodeID = node[0].text
        try:
            tweet = urllib2.urlopen('https://api.twitter.com/1/statuses/show.json?id={0}&include_entities=true'.format(nodeID))
            tweetData = tweet.read()
            print tweetData
            jsonFile.write('{0}\n'.format(tweetData).',')
        except:
            pass
    iterator = iterator + 1
jsonFile.close() 

Solution

  • Twitter API's have got strict API limits. They throttle their API's. If you are frequently hitting their API's it is very likely that they would stop serving content to you either permanently or for a fixed period of time. To get an idea of what exactly are the limits check there API Rate Limiting and Rate limits

    Also Twitter themselves admit that with the amount of data they have to deal with their normal API's serve about 1% of the actual data coming in. If you want the entire data set for your particular API type then you need to access their Twitter Firehouse API's.