Search code examples
pythonproxytwistedscrapy

Scrapy Timeouts and Twisted.Internet.Error


Running Scrapy with Proxies but there are times when the crawl runs into the errors below at the end of the run and causes the crawl finish time to be delayed by 10+ seconds. How can I make it so that if Scrapy runs into these errors at any point, it is ignored/passed completely and immediately when detected so that it doesn't waste time stalling the entire crawler?

RETRY_ENABLED = False (Set in settings.py already.)

List of urls in request. Many proxies set to https:// rather than http, wanted to mention incase, although for almost all cases https works, so I doubt it is strictly about https vs http being set.

But still get:

Error 1:

  • 2019-01-20 20:24:02 [scrapy.core.scraper] DEBUG: Scraped from <200>
  • ------------8 seconds spent------------------
  • 2019-01-20 20:24:10 [scrapy.proxies] INFO: Removing failed proxy
  • 2019-01-20 20:24:10 [scrapy.core.scraper] ERROR: Error downloading
  • Traceback (most recent call last):
  • File "/usr/local/lib64/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request defer.returnValue((yield download_func(request=request,spider=spider)))
  • scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy ukimportantd2.fogldn.com:10492 [{'status': 504, 'reason': b'Gateway Time-out'}]

Error 2:

  • 2019-01-20 20:15:46 [scrapy.proxies] INFO: Removing failed proxy
  • 2019-01-20 20:15:46 [scrapy.core.scraper] ERROR: Error downloading
  • ------------12 seconds spent------------------
  • 2019-01-20 20:15:58 [scrapy.core.engine] INFO: Closing spider (finished)
  • Traceback (most recent call last):
  • File "/usr/local/lib64/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request defer.returnValue((yield download_func(request=request,spider=spider)))
  • twisted.web._newclient.ResponseNeverReceived: [twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion: Connection lost.]

Error 3:

  • Traceback (most recent call last):
  • File "/usr/local/lib64/python3.7/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request defer.returnValue((yield download_func(request=request,spider=spider)))
  • twisted.web._newclient.ResponseNeverReceived: [twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was losection lost.]

Solution

  • How can I make it so that if Scrapy runs into these errors at any point, it is ignored/passed completely and immediately when detected

    That is already the case. The proxies are either causing the error after a few seconds instead of instantly, or directly timing out.

    If you are not willing to wait, you could consider decreasing the DOWNLOAD_TIMEOUT setting, but responses that used to take long but work may start timing out.

    A better approach may be to switch to better proxies, or consider a smart proxy (e.g. Crawlera).