I followed the tutorial at https://scrapeops.io/python-scrapy-playbook/scrapy-beginners-guide-cleaning-data/ to setup scrapy itemloaders. However, I don't understand how to modify the itemloaders to return all elements in a list rather than just the first element.
I am able to get the necessary data with the following code:
print(''.join(data.xpath(".//text()").extract()))
Printing the data without using the itemloader or ''.join gives: ['A Long-term Study for Participants Previously Treated With ', 'Ciltacabtagene', ' Autoleucel']
The itemloader gives: A Long-term Study for Participants Previously Treated With
The above print returns: A Long-term Study for Participants Previously Treated With Ciltacabtagene Autoleucel
itemloaders.py
from itemloaders.processors import TakeFirst, MapCompose, Join
from scrapy.loader import ItemLoader
class DataLoader(ItemLoader):
default_output_processor = TakeFirst()
title_in = MapCompose(lambda x: x)
How do I modify itemloaders.py to return the necessary data?
The reason that your output is one item is because of the line default_output_processor = TakeFirst()
which as per the name picks the first item from the output list. First option is to use a different default output processor such as Join()
or Identity
or a self defined function depending on your use case. The second option is to define an appropriate output processor for the specific field e.g. to join the output from the title extractor using a space using Join()
you could define an output processor for the title as below (take note of the naming convention <field_name>_out
).
from itemloaders.processors import TakeFirst, MapCompose, Join
from scrapy.loader import ItemLoader
class DataLoader(ItemLoader):
default_output_processor = TakeFirst()
title_in = MapCompose(lambda x: x)
title_out = Join()