Search code examples
pythonscrapy

scrapy create yield field dynamically with a variable


I want go get all bullet points with scrapy from an amazon product page e.g. Amazon link, however their number varies. I end up using something like this

def parse(self, response):
        t = response
        url = t.request.url
        yield{
                'bullets_no': len(t.xpath('//div[@id="feature-bullets"]//li/span/text()'))
                'bullet_1' : t.xpath('//div[@id="feature-bullets"]//li/span/text()')[0].get().strip()
                'bullet_2' : t.xpath('//div[@id="feature-bullets"]//li/span/text()')[1].get().strip()
                'bullet_3' : t.xpath('//div[@id="feature-bullets"]//li/span/text()')[2].get().strip()
                'bullet_4' : t.xpath('//div[@id="feature-bullets"]//li/span/text()')[3].get().strip()
                'bullet_5' : t.xpath('//div[@id="feature-bullets"]//li/span/text()')[4].get().strip()
...
            }

however in pythong i would be able to simply do something like this and adjust automatically:

bullets = t.xpath('//div[@id="feature-bullets"]//li/span/text()')
    for i, bullet in enumerate(bullets):
        row[f'Bullet_{i+1}'] = bullet.strip()

Is it possible to create yielded fields like this in scrapy?


Solution

  • Yes, this is covered in detail in the scrapy tutorial which I highly suggest reading.

    The return type when using either the response.css or response.xpath calls is a SelectorList object. You can iterate this object like you can a regular python list object.

    The result of running response.css('title') is a list-like object called SelectorList, which represents a list of Selector objects that wrap around XML/HTML elements and allow you to run further queries to fine-grain the selection or extract the data.

    So using your example you could do something like this:

    def parse(self, response):
        item = {'url': response.url}
        for i, bullet in enumerate(response.xpath('//div[@id="feature-bullets"]//li/span/text()'), start=1):
            item[f'bullet_{i}'] = bullet.get().strip()
        item['bullet_no'] = i
        yield item
    

    As mention in a previous answer there is also the getall method that you can call on a selector list:

    The other thing is that the result of calling .getall() is a list: it is possible that a selector returns more than one result, so we extract them all.

    I suggest giving the Extracting Data and Extracting Quotes and Authors sections of the scrapy docs tutorial a read to find out more.