i want to scrap javascript loading web pages, so i use scrapy-splash but some pages so lots of loading time.
i think [processUser..] things that makes slower.
there are any way to ignore that 504 pages ? because when i set timeout less than 90 , cause 504 gateway error in scrapy shell or spiders.
and can get result html code ( only get 200 ) when time i set is over?
There's a mechanism in splash to abort a request before it starts loading the body which you can leverage using splash:on_response_headers
hook. However in your case this hook will only be able to catch and abort the page when the status and the headers are in, and that is after it finishes waiting for the gateway timeout (504). So instead you might want splash:on_request
hook to abort the request before it's even sent like so
function main(splash, args)
splash:on_request(function(request)
if request.url:find('processUser') then
request:abort()
end
end)
assert(splash:go(args.url))
assert(splash:wait(.5))
return {
har = splash:har(),
}
end
UPD: Another and perhaps a better way to go about this is to set splash.resource_timeout
before any requests take place:
function main(splash, args)
splash.resource_timeout = 3
...