Search code examples
pythonscrapyscrapy-splash

Why Splash+Scrapy add html header to json response


What I'm missing?

I'm trying to scrapy some json but I'm keeping receiving this html header with the json response:

response.data['html'] return:

2021-02-18 10:35:57 [bcb] DEBUG: b'<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{"TotalRows":132,"RowCount":15,"Rows":[{"tit`....

Here is the code:

    yield scrapy.Request(address_pesquisa, self.parse, meta={
            'splash': {
                'args': {
                    # set rendering arguments here
                    'html': 1,
                    'png': 0,

                },

                # optional parameters
                'endpoint': 'render.json',  # optional; default is render.json
                'splash_url': 'http://192.168.15.100:8050',  # optional; overrides SPLASH_URL
                'slot_policy': scrapy_splash.SlotPolicy.PER_DOMAIN,
                'splash_headers': {},  # optional; a dict with headers sent to Splash
                'dont_process_response': False,  # optional, default is False
                'dont_send_headers': True,  # optional, default is False
                'magic_response': True,  # optional, default is True
            }
        })

I have to remove this header by my self with some regex or what? Or my scrapy is misconfigured?


Solution

  • Straightforward option for extracting the JSON inside the HTML would be to use XPath (or CSS selectors). Here's the documentation for Scrapy Selectors.

    Something like this in scrapy.Request callback function (self.parse)

    json_response = response.xpath('html/body/pre/text()').get()
    json_response = json.loads(json_response)
    

    Note that I didn't test the code so you might need to change it a little bit (if I typo'd the XPath or something).

    Also, you might want to try downloading the page with i.e. curl or Scrapy shell and check if the HTML part is still in the response. If not, somehow using Splash might make the website return a response that has the HTML.


    Update on why the HTML is not in the response when using curl:

    One possibility is that the web server returns a different response when using a browser than when using curl. One reason for doing this is to make the JSON more readable for the user using the browser. I mean, trying to read through JSON is easier when it's properly formatted and not just everything on a single line :D

    So, if this is the case, my guess would be that Splash passes some data to the server (i.e. User-Agent, being able to render JavaScript) that makes the server return a response with the HTML.

    Skipping Splash and using just Scrapy Request for making the request could help (and also make the crawler a little bit faster).

    Anyway, if the XPath works (and the small and only possible speed increase doesn't matter), go with the XPath.