python web-scraping scrapy scrapy-pipeline

List elements retrieved by Xpath in scrapy do not output correctly item by item(for,yield)

I am outputting the URL of the first page of the order results page of an exhibitor extracted from a specific EC site to a csv file, reading it in start_requests, and looping through it with a for statement.

Each order result page contains information on 30 products.

https://www.buyma.com/buyer/2597809/sales_1.html

itempage

Specify the links for the 30 items on each order results page and list? type, and I tried to retrieve them one by one and store them in the item as shown in the code below, but it does not work.

class AllSaledataSpider(CrawlSpider):
name = 'all_salesdata_copy2'
allowed_domains = ['www.buyma.com']



def start_requests(self):
     with open('/Users/morni/researchtool/AllshoppersURL.csv', 'r', encoding='utf-8-sig') as f:
        reader = csv.reader(f)
        for row in reader:
            for n in range(1, 300): 
                url =str((row[2])[:-5]+'/sales_'+str(n)+'.html')
                yield scrapy.Request(
                    url=url,
                    callback=self.parse_firstpage_item,
                    dont_filter=True
                    )

def parse_firstpage_item(self, response): 
        loader = ItemLoader(item = ResearchtoolItem(), response = response)

        Conversion_date = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[3]/text()').getall()
        product_name = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/text()').getall()
        product_URL = response.xpath('//*[@id="buyeritemtable"]/div/ul/li[2]/p[1]/a/@href').getall()

        for i in range(30):
            loader.add_value("Conversion_date", Conversion_date[i])
            loader.add_value("product_name", product_name[i])
            loader.add_value("product_URL", product_URL[i])
           
            yield loader.load_item()

Specify the links for the 30 items on each order results page and list? type, and I tried to retrieve them one by one and store them in the item as shown in the code below, but it does not work.

The output is as follows, where each item contains multiple items of information at once.

Current status: {"product_name": ["product1", "product2"]), "Conversion_date":["Conversion_date1", "Conversion_date2" ], "product_URL":["product_URL1", "product_URL2"]},

Ideal: [{"product_name": "product1", "Conversion_date": Conversion_date1", "product_URL": "product_URL1"},{"product_name": "product2", "Conversion_date": Conversion_date2", "product_URL": "product_URL2"}]

This may be due to my lack of understanding of basic for statements and yield.

Solution

You need to create a new loader each iteration

for i in range(30):
    loader = ItemLoader(item = ResearchtoolItem(), response = response)
    loader.add_value("Conversion_date", Conversion_date[i])
    loader.add_value("product_name", product_name[i])
    loader.add_value("product_URL", product_URL[i])
    
    yield loader.load_item()

EDIT:

add_value appends a value to the list. Since you had zero elements in the list, then after you append you'll have a list with one element.

In order to get the values as a string you can use a processor. Example:

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst


class ProductItem(scrapy.Item):
    name = scrapy.Field(output_processor=TakeFirst())
    price = scrapy.Field(output_processor=TakeFirst())


class ExampleSpider(scrapy.Spider):
    name = 'exampleSpider'
    start_urls = ['https://scrapingclub.com/exercise/list_infinite_scroll/']

    def parse(self, response, **kwargs):
        names = response.xpath('//div[@class="card-body"]//h4/a/text()').getall()
        prices = response.xpath('//div[@class="card-body"]//h5//text()').getall()
        length = len(names)

        for i in range(length):
            loader = ItemLoader(item=ProductItem(), response=response)
            loader.add_value('name', names[i])
            loader.add_value('price', prices[i])

            yield loader.load_item()