Search code examples
pythonweb-scrapinguser-agentscrapy

Share USER_AGENT between scrapy_fake_useragent and cfscrape scrapy extension


I'm trying to create a scraper for cloudfare protected website using cfscrape, privoxy and tor, and scrapy_fake_useragent

I'm using cfscrape python extension to bypass cloudfare protection with scrapy and scrapy_fake_useragent to inject random real USER_AGENT information into headers.

As indicated by cfscrape documentation : You must use the same user-agent string for obtaining tokens and for making requests with those tokens, otherwise Cloudflare will flag you as a bot.

To collect cookie needed by `cfscrape`, i need to redefine the `start_request` function into my spider class, like this : 

    def start_requests(self):
        cf_requests = []
        for url in self.start_urls:
            token, agent = cfscrape.get_tokens(url)
            self.logger.info("agent = %s", agent)
            cf_requests.append(scrapy.Request(url=url,
                                              cookies= token,
                                              headers={'User-Agent': agent}))
        return cf_requests

My problem is that the user_agent collected by start_requests is not the same that the user_agent randomly selected by scrapy_fake_useragent , as you can see :

017-01-11 12:15:08 [airports] INFO: agent = Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0
2017-01-11 12:15:08 [scrapy.core.engine] INFO: Spider opened
2017-01-11 12:15:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-11 12:15:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-11 12:15:08 [scrapy_fake_useragent.middleware] DEBUG: Assign User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10 to Proxy http://127.0.0.1:8118

I defined my extension in settings.py this order :

RANDOM_UA_PER_PROXY = True
HTTPS_PROXY = 'http://127.0.0.1:8118'
COOKIES_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
    'flight_project.middlewares.ProxyMiddleware': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':110,
    }

I need the same user_agent, so how can i pass/get the user agent given randomly by scrapy_fake_useragent into the start_requests method for cfscrape extension ?


Solution

  • Finaly found the answer with help of scrapy_user_agent developer. Desactivate the line 'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400 in settings.py then write this source code :

    class AirportsSpider(scrapy.Spider):
        name = "airports"
        start_urls = ['https://www.flightradar24.com/data/airports']
        allowed_domains = ['flightradar24.com']
    
        ua = UserAgent()
        ...
    
        def start_requests(self):
            cf_requests = []
            user_agent = self.ua.random
            self.logger.info("RANDOM user_agent = %s", user_agent)
            for url in self.start_urls:
                token , agent = cfscrape.get_tokens(url,user_agent)
                self.logger.info("token = %s", token)
                self.logger.info("agent = %s", agent)
    
                cf_requests.append(scrapy.Request(url=url,
                                              cookies= token,
                                              headers={'User-Agent': agent}))
            return cf_requests