In below code,
len(self.crawler.engine.slot.scheduler)
is always returning 0self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']
is returning value in increasing order: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10I was expecting the queue to be high initially and in decreasing order as URLs get crawled. Higher queue before crawling and lower value of queue after crawling.
Also, uncommenting this code shows similar trend of increasing queue size.
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
note: I have set CONCURRENT_REQUESTS = 1
in settings
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes_spider"
start_urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
"https://quotes.toscrape.com/page/3/",
"https://quotes.toscrape.com/page/4/",
"https://quotes.toscrape.com/page/5/",
"https://quotes.toscrape.com/page/6/",
"https://quotes.toscrape.com/page/7/",
"https://quotes.toscrape.com/page/8/",
"https://quotes.toscrape.com/page/9/",
"https://quotes.toscrape.com/page/10/",
]
def parse(self, response):
print(f"\n before {self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']} \n\n")
print(f"\n before2 {len(self.crawler.engine.slot.scheduler)}") # dont know why it always returns zero
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(),
"author": quote.css("small.author::text").get(),
"tags": quote.css("div.tags a.tag::text").getall(),
}
next_page = response.css("li.next a::attr(href)").get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
print(f"\n After {self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']} \n\n")
print(f"\n after2 {len(self.crawler.engine.slot.scheduler)}") # dont know why it always returns zero
this is the original question (I could not comment there because of low reputation): How to get the number of requests in queue in scrapy?
scrapy code copied from: https://docs.scrapy.org/en/latest/intro/tutorial.html
How to get the number of requests in queue in python scrapy?
len(self.crawler.engine.slot.scheduler)
(assuming you mean the scheduler queue, which is the context of the original question and the things you tried)
self.crawler.engine.slot.scheduler.stats._stats['scheduler/enqueued']
is returning value in increasing order: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
It's an expected behavior for a stat that counts total scheduled requests. Stats in general don't reflect current state of something.
len(self.crawler.engine.slot.scheduler)
is always returning 0
This means the scheduler queue is empty at those points, which makes sense for a spider that downloads pages faster than they are scheduled.