Search code examples
scrapyscrapy-splash

Can scrapy-splash Ignore 504 HTTP Status?


i want to scrap javascript loading web pages, so i use scrapy-splash but some pages so lots of loading time.

like this : enter image description here

i think [processUser..] things that makes slower.

there are any way to ignore that 504 pages ? because when i set timeout less than 90 , cause 504 gateway error in scrapy shell or spiders.

and can get result html code ( only get 200 ) when time i set is over?


Solution

  • There's a mechanism in splash to abort a request before it starts loading the body which you can leverage using splash:on_response_headers hook. However in your case this hook will only be able to catch and abort the page when the status and the headers are in, and that is after it finishes waiting for the gateway timeout (504). So instead you might want splash:on_request hook to abort the request before it's even sent like so

    function main(splash, args)
        splash:on_request(function(request)
            if request.url:find('processUser') then
                request:abort()
            end
        end)
        assert(splash:go(args.url))
        assert(splash:wait(.5))
        return {
            har = splash:har(),
        }
    end
    

    UPD: Another and perhaps a better way to go about this is to set splash.resource_timeout before any requests take place:

    function main(splash, args)
        splash.resource_timeout = 3
        ...