python multithreading scrapy python-multithreading

How to create a pool of threads

I am trying to scrape all the products on this site: https://www.jny.com/collections/jackets

It will get links to all the products and then scrape them one by one. I am trying to speed up this process by multi-threading. Here is the code:

def yield1(self, url):
    print("inside function")
    yield scrapy.Request(url, callback=self.parse_product)


def parse(self, response):
    print("in herre")
    self.product_url =  response.xpath('//div[@class = "collection-grid js-filter-grid"]//a/@href').getall()
    print(self.product_url)
    for pu in self.product_url:
        print("inside the loop")
        with ThreadPoolExecutor(max_workers=10) as executor:
             print("inside thread")
             executor.map(self.yield1, response.urljoin(pu))

It is supposed to create a pool of 10 threads each of which will execute yield1() on the list of URLs. Problem is that yield1() method is not being called.

Solution

yield1 is a generator function. To get it to yield a value you have to call next on it. Change it so it returns a value

def yield1(self, url):
    print("inside function")
    return scrapy.Request(url, callback=self.parse_product)

caveat:I don't really know anything about Scrapy.

The Overview in the docs says that requests are made asynchronously. Your code doesn't look like the examples given in those docs. The example in the overview shows subsequent requests being made in the parse method using response.follow. Your code looks like you are trying to extract links from a page and then asynchronously scrape those links and parse them with a different method. Since it seems like Scrapy will do this for you and handle the asynchronicity (?) I think you just need to define another parse method in your spider and use response.follow to schedule more asynchronous requests. You shouldn't need concurrent.futures, the new requests should all be processed asynchrounously.

I have no way of testing this but I think your spider should look more like this:

class TempSpider(scrapy.Spider):
    name = 'foo'
    start_urls = [
        'https://www.jny.com/collections/jackets',
    ]
    def parse(self, response):
        self.product_url =  response.xpath('//div[@class = "collection-grid js-filter-grid"]//a/@href').getall()
        for pu in self.product_url:
            print("inside the loop")
            response.urljoin(pu)
            yield response.follow(response.urljoin(pu), self.parse_product)

    def parse_product(self, response):
        '''parses the product urls'''

This assumes self.product_url = response.xpath('//div[@class = "collection-grid js-filter-grid"]//a/@href').getall() does what it is supposed to.

Maybe even a separate Spider to parse the subsequent links. Or use a CrawlSpider .

Relevant SO Q&A's
Scraping links with Scrapy
scraping web page containing anchor tag using scrapy
Use scrapy to get list of urls, and then scrape content inside those urls (looks familiar)
Scrapy, scrape pages from second set of links
many more