Search code examples
http-status-code-403mechanicalsoup

403 error with mechanicalsoup


Why do I get 403 when I try to scrape hacked.com and how can I get around it? According to doesitusecloudflare.com, there is no cloudflare in the way (http://www.doesitusecloudflare.com/?url=https%3A%2F%2Fhacked.com%2Fwp-login.php) the robots.txt allow any useragent and only disallow access to the wp-admin log in.

>>> import mechanicalsoup
>>> browser = mechanicalsoup.StatefulBrowser()
>>> browser.get('https://google.com')
<Response [200]>
>>> browser.get('https://hacked.com')
<Response [403]>
>>> browser.get('https://hacked.com').content
b'<html>\r\n<head><title>403 Forbidden</title></head>\r\n<body bgcolor="white">\r\n<center><h1>403 Forbidden</h1></center>\r\n<hr><center>nginx</center>\r\n</body>\r\n</html>\r\n'

Solution

  • As we can see in mechanicalsoup/browser.py#L106, .get() is just a wrapper around requests.get(), so let's drop to that instead.

    As we can see; the problem persists with just requests:

    >>> import requests
    >>> response = requests.get('https://hacked.com')
    >>> response
    <Response [403]>
    >>> 
    

    I had an inkling, so I removed the User-Agent string:

    >>> request = response.request
    >>> request.headers
    {'User-Agent': 'python-requests/2.18.4', ...}
    >>> del(request.headers['User-Agent'])
    >>> request.headers
    {...}
    >>> 
    

    And tried again:

    >>> session = requests.Session()
    >>> session.send(request)
    <Response [200]>
    >>> 
    

    Tada! It looks like this is someone over at hacked's attempt to avoid a certain bot - even though their robots.txt says you're allowed.

    So, back to the problem in your context, it seems like we just need to set a User-Agent string that isn't what requests sends by default. I can't see a way to unset it with MechanicalSoup, so here's the best method I found:

    >>> import mechanicalsoup
    >>> b = mechanicalsoup.StatefulBrowser()
    >>> b.set_user_agent('my-awesome-script')
    >>> b.get('https://hacked.com')
    <Response [200]>
    >>>