Search code examples
pythonweb-scrapingscrapyscrapy-pipeline

List elements retrieved by Xpath in scrapy do not output correctly item by item(for,yield)


I am outputting the URL of the first page of the order results page of an exhibitor extracted from a specific EC site to a csv file, reading it in start_requests, and looping through it with a for statement.

Each order result page contains information on 30 products.

https://www.buyma.com/buyer/2597809/sales_1.html

itempage

Specify the links for the 30 items on each order results page and list? type, and I tried to retrieve them one by one and store them in the item as shown in the code below, but it does not work.

class AllSaledataSpider(CrawlSpider):
name = 'all_salesdata_copy2'
allowed_domains = ['www.buyma.com']



def start_requests(self):
     with open('/Users/morni/researchtool/AllshoppersURL.csv', 'r', encoding='utf-8-sig') as f:
        reader = csv.reader(f)
        for row in reader:
            for n in range(1, 300): 
                url =str((row[2])[:-5]+'/sales_'+str(n)+'.html')
                yield scrapy.Request(
                    url=url,
                    callback=self.parse_firstpage_item,
                    dont_filter=True
                    )

def parse_firstpage_item(self, response): 
        loader = ItemLoader(item = ResearchtoolItem(), response = response)

        Conversion_date = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()').getall()
        product_name = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/text()').getall()
        product_URL = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/@href').getall()

        for i in range(30):
            loader.add_value("Conversion_date", Conversion_date[i])
            loader.add_value("product_name", product_name[i])
            loader.add_value("product_URL", product_URL[i])
           
            yield loader.load_item()

Specify the links for the 30 items on each order results page and list? type, and I tried to retrieve them one by one and store them in the item as shown in the code below, but it does not work.

The output is as follows, where each item contains multiple items of information at once.

Current status: {"product_name": ["product1", "product2"]), "Conversion_date":["Conversion_date1", "Conversion_date2" ], "product_URL":["product_URL1", "product_URL2"]},

Ideal: [{"product_name": "product1", "Conversion_date": Conversion_date1", "product_URL": "product_URL1"},{"product_name": "product2", "Conversion_date": Conversion_date2", "product_URL": "product_URL2"}]

This may be due to my lack of understanding of basic for statements and yield.


Solution

  • You need to create a new loader each iteration

    for i in range(30):
        loader = ItemLoader(item = ResearchtoolItem(), response = response)
        loader.add_value("Conversion_date", Conversion_date[i])
        loader.add_value("product_name", product_name[i])
        loader.add_value("product_URL", product_URL[i])
        
        yield loader.load_item()
    

    EDIT:

    add_value appends a value to the list. Since you had zero elements in the list, then after you append you'll have a list with one element.

    In order to get the values as a string you can use a processor. Example:

    import scrapy
    from scrapy.loader import ItemLoader
    from scrapy.loader.processors import TakeFirst
    
    
    class ProductItem(scrapy.Item):
        name = scrapy.Field(output_processor=TakeFirst())
        price = scrapy.Field(output_processor=TakeFirst())
    
    
    class ExampleSpider(scrapy.Spider):
        name = 'exampleSpider'
        start_urls = ['https://scrapingclub.com/exercise/list_infinite_scroll/']
    
        def parse(self, response, **kwargs):
            names = response.xpath('//div[@class="card-body"]//h4/a/text()').getall()
            prices = response.xpath('//div[@class="card-body"]//h5//text()').getall()
            length = len(names)
    
            for i in range(length):
                loader = ItemLoader(item=ProductItem(), response=response)
                loader.add_value('name', names[i])
                loader.add_value('price', prices[i])
    
                yield loader.load_item()