I am studying scrapy examples at https://www.accordbox.com/blog/how-crawl-infinite-scrolling-pages-using-python/
Regarding to yield Request of the Scrapy solution code in there, I am very confused.
There are three yield Requests. Sometime a Request is just generated, sometime it is generated and executed, sometime it is just executed.
Could you advice me what are the differences between them please?
Thank you!.
def parse_list_page(self, response):
next_link = response.xpath(
"//a[@class='page-link next-page']/@href").extract_first()
if next_link:
url = response.url
next_link = url[:url.find('?')] + next_link
################################
# Generate and Execute Request
################################
yield Request(
url=next_link,
callback=self.parse_list_page
)
for req in self.extract_product(response):
################################
#Just Execute Request
################################
yield req
def extract_product(self, response):
links = response.xpath("//div[@class='col-lg-8']//div[@class='card']/a/@href").extract()
for url in links:
result = parse.urlparse(response.url)
base_url = parse.urlunparse(
(result.scheme, result.netloc, "", "", "", "")
)
url = parse.urljoin(base_url, url)
################################
#Just Generate Request
################################
yield Request (
url=url,
callback=self.parse_product_page
)
def parse_product_page(self, response):
logging.info("processing " + response.url)
yield None
You may find the figure here useful for answering your question.
The yield
s from the parse_list_page
method are yielding requests back to the "Engine" (step 7 in the figure). The yield
in extract_product
is yielding back to parse_list_page
. parse_list_page
then immediately yields them back to the engine.
Note that all the code in extract_product
could also just go into parse_list_page
to make one single method. Having two methods just nicely separates the logic.