Search code examples
pythonpython-2.7urllib2

Downloading a sequence of webpages using Python


I am very new to Python [running 2.7.x] and I am trying to download content from a webpage with thousands of links. Here's my code:

import urllib2
i = 1
limit = 1441

for i in limit: 
    url = 'http://pmindia.gov.in/content_print.php?nodeid='+i+'&nodetype=2'
    response = urllib2.urlopen(url)
    webContent = response.read()
    f = open('speech'+i+'.html', 'w')
    f.write(webContent)
    f.close

Fairly elementary, but I get one or both of these errors 'int object is not iterable' or 'cannot concatenate str and int'. These are the printable versions of the links on this page: http://pmindia.gov.in/all-speeches.php (1400 links). But the node id's go from 1 to 1441 which means 41 numbers are missing (which is a separate problem). Final final question: in the long run, while downloading thousands of link objects, is there a way to run them in parallel to increase processing speed?


Solution

  • There are a couple of mistakes in your code.

    1. You got the syntax of for wrong. When you call the for loop, you need to pass it a an object that it can iterate on. This can be a list or a generator
    2. adding a number to a string won't work. You need to convert with for example repr

    With those fixes your code look like

    import urllib2
    i = 1
    limit = 1441
    
    for i in xrange(1,limit+1): 
        url = 'http://pmindia.gov.in/content_print.php?nodeid='+repr(i)+'&nodetype=2'
        response = urllib2.urlopen(url)
        webContent = response.read()
        f = open('speech'+repr(i)+'.html', 'w')
        f.write(webContent)
        f.close
    

    Now, if you want to go into web scraping for real, I suggest you have a look at some packages such as lxml and requests