Search code examples
pythoncurlurllib2pycurl

get many pages with pycurl?


I want to get many pages from a website, like

curl "http://farmsubsidy.org/DE/browse?page=[0000-3603]" -o "de.#1"

but get the pages' data in python, not disk files. Can someone please post pycurl code to do this,
or fast urllib2 (not one-at-a-time) if that's possible,
or else say "forget it, curl is faster and more robust" ? Thanks


Solution

  • here is a solution based on urllib2 and threads.

    import urllib2
    from threading import Thread
    
    BASE_URL = 'http://farmsubsidy.org/DE/browse?page='
    NUM_RANGE = range(0000, 3603)
    THREADS = 2
    
    def main():
        for nums in split_seq(NUM_RANGE, THREADS):
            t = Spider(BASE_URL, nums)
            t.start()
    
    def split_seq(seq, num_pieces):
        start = 0
        for i in xrange(num_pieces):
            stop = start + len(seq[i::num_pieces])
            yield seq[start:stop]
            start = stop
    
    class Spider(Thread):
        def __init__(self, base_url, nums):
            Thread.__init__(self)
            self.base_url = base_url
            self.nums = nums
        def run(self):
            for num in self.nums:
                url = '%s%s' % (self.base_url, num)
                data = urllib2.urlopen(url).read()
                print data
    
    if __name__ == '__main__':
        main()