Search code examples
pythonweb-crawlermechanizescraper

Python + Mechanize not working with Delicious


I'm using Mechanize and Beautiful soup to scrape some data off Delicious

from mechanize import Browser
from BeautifulSoup import BeautifulSoup

mech = Browser()
url = "http://www.delicious.com/varunsrin"
page = mech.open(url)
html = page.read()

soup = BeautifulSoup(html)
print soup.prettify()

This works for most sites I throw it at, but fails on Delicious with the following output

Traceback (most recent call last):  
File "C:\Users\Varun\Desktop\Python-3.py",
line 7, in <module>
    page = mech.open(url)
File "C:\Python26\lib\site-packages\mechanize\_mechanize.py",
line 203, in open
    return self._mech_open(url, data, timeout=timeout)   File
"C:\Python26\lib\site-packages\mechanize\_mechanize.py",
line 255, in _mech_open
    raise response httperror_seek_wrapper: HTTP Error
403: request disallowed by robots.txt
C:\Program Files (x86)\ActiveState Komodo IDE 6\lib\support\dbgp\pythonlib\dbgp\client.py:1360:
DeprecationWarning:
BaseException.message has been deprecated as of Python 2.6
    child = getattr(self.value, childStr)
C:\Program Files (x86)\ActiveState Komodo IDE 6\lib\support\dbgp\pythonlib\dbgp\client.py:456:
DeprecationWarning:
BaseException.message has been deprecated as of Python 2.6
    return apply(func, args)

Solution

  • Take some of the tips for emulating a browser with python+mechanize from here. Adding addheaders and set_handle_robots appears to be the minimum required. With the code below, I get output:

    from mechanize import Browser, _http
    from BeautifulSoup import BeautifulSoup
    
    br = Browser()    
    br.set_handle_robots(False)
    br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
    
    url = "http://www.delicious.com/varunsrin"
    page = br.open(url)
    html = page.read()
    
    soup = BeautifulSoup(html)
    print soup.prettify()