Search code examples
pythonscrapythroughput

How to find when a request had started and when it got ended in scrapy


I am trying to measure the throughput of the system in scrapy and I am trying to find when the HTTP request was fired and when it was completed completed in scrapy.

Any directions to find a solution is highly appreciated.


Solution

  • You could use custom middleware:

    class MeasureMiddleware:
        requests = []
    
        def process_request(self, request, spider):
            # store the time and url of every outgoing request
            self.requests.append((request.url, datetime.now()))
    
        def process_response(self, request, response, spider):
            # for everyone response check if one of tracked requests cameback
            # if so, print start time and current time
            filtered_requests = []
            # go through tracked requests and check whether any of them match current url
            for request in self.requests:
                url, start_date = request
                if url == request.url:
                    logging.info(f'request {url} {start_date} - {datetime.now()}')
                else:
                    filtered_requests.append(request)
            self.requests = filtered_requests
    

    Then activate the downloader middleware

    DOWNLOADER_MIDDLEWARES = {
        'myproject.middlewares.MeasureMiddleware': 543,
    }
    

    It's worth noting that due to async nature of scrapy it won't be ms accurate but it should be accurate enough to give a generic overview.