Search code examples
pythonrequestweb-crawler

Python request to crawl URL returns 404 Error while working inside the browser


I have a crawling python script that hangs on a url: pulsepoint.com/sellers.json

The bot uses a standard request to get the content, but is returned Error 404. In the browser it works (there is a 301 redirect, but request can follow that). My first hunch is that this could be a request header issue, so I copied my browser configuration. The code looks like this

        crawled_url="pulsepoint.com"
        seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
        print(seller_json_url)
        myheaders = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
                'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
                'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3',
                'Accept-Encoding': 'gzip, deflate, br',
                'Connection': 'keep-alive',
                'Pragma': 'no-cache',
                'Cache-Control': 'no-cache'
            }
        r = requests.get(seller_json_url, headers=myheaders)
        logging.info("  %d" % r.status_code)

But I am still getting a 404 Error.

My next guess:

  • Login? Not used here
  • Cookies? Not that I can see

So how is their server blocking my bot? This is an URL that is supposed to be crawled by the way, nothing illegal..


Solution

  • You can also do a workaround on the SSL certificate error like below:

    from urllib.request import urlopen
    import ssl
    import json
    
    #this is a workaround on the SSL error
    ssl._create_default_https_context = ssl._create_unverified_context
    crawled_url="pulsepoint.com"
    seller_json_url = 'http://{thehost}/sellers.json'.format(thehost=crawled_url)
    print(seller_json_url)
    
    response = urlopen(seller_json_url).read() 
    # print in dictionary format
    print(json.loads(response)) 
    

    Sample response:

    {'contact_email': '[email protected]', 'contact_address': '360 Madison Ave, 14th Floor, NY, NY, 10017', 'version': '1.0', 'identifiers': [{'name': 'TAG-ID', 'value': '89ff185a4c4e857c'}], 'sellers': [{'seller_id': '508738', ...

    ...'seller_type': 'PUBLISHER'}, {'seller_id': '562225', 'name': 'EL DIARIO', 'domain': 'impremedia.com', 'seller_type': 'PUBLISHER'}]}