Search code examples
scrapymiddleware

Advanced Scrapy use Middleware


I want to developt many middlewares to be sure websites'll be parse. This is the workflow I thinks :

  • First try with TOR + Polipo
  • If 2 HTTP errors, try without TOR (so website know my IP)
  • If 2 HTTP errors, try with proxy (use one of my other server to make HTTP REQ)
  • If 2 HTTP errors, try with random proxy (on list of 100). This is repeat 5 times
  • If none works, I save informations on ElasticSearch database, to see on my control panel

I'll create a custom middleware, with process_request function wich contains all of this 5 methods. But I don't find how save type of connection (for exemple if TOR not works, but direct connection yes, I want to use this settings for all of my other scrap, for the same website). How can I save this settings ?

Other thinks, I've a pipeline wich download images of items. Is there a solution to use this middleware (idealy with saving settings) to use on it ?

Thanks in advance for you're help.


Solution

  • I think you could use the retry middleware as a starting point:

    1. You could use request.meta["proxy_method"] to keep track of which one you are using

    2. You could reuse request.meta["retry_times"] in order to track how many times you have retried a given method, and then set the value to zero when you change the proxy method.

    3. You could use request.meta["proxy"] to use the proxy server you want via the existing HTTP proxy middleware. You may want to tweak the middlewares ordering so that the retry middleware runs before the proxy middleware.