So when scraping, I'm looking for an element which is a combination of two different elements in the html code. I was thinking about using the ItemLoaders of scrapy to get rid of the ugly code that this might produce. To reach the elements the following selectors can be used:
main_element = response.css('css_to_main')
element_one = main_element.css('css_to_one::text').get()
element_two = main_element.css('css_to_two::text').get()
final_element = element_one + element_two # (with some extra processing one both elements)
To achieve the desired effect, I start of with passing the main_element:
l = MyLoader(MyItem(), selector=response)
l.add_css('variable_name','css_to_main')
which then passes through the loader
class MyLoader(ItemLoader):
variable_name_in = Combine()
variable_name_out = Identity()
class Combine:
def __call__(self,values):
main_element = values[0]
first_element = main_element.css('span.css_to_one::text').get()
second_element = main_element.css('span.css_to_two::text').get()
return [first_element, second_element]
The idea is that it then gets passed to the item:
class MyItem(scrapy.Item):
variable_name = scrapy.Field(
input_processor = MapCompose(remove_tags, strip_content),
output_processor = Join('')
)
However, this method does not work. I can't seem to figure out how the .add_css method passes the given value to the loader and so on, does anyone have an idea on how to construct such processing for items in Scrapy?
Using itemloaders is the correct way. Pass the two selectors in sequence and then use an output processor to join them. The default Itemloader
can serve the purpose
from scrapy.loader import ItemLoader
from itemloaders.processors import Join
l = ItemLoader(MyItem(), response=response, selector=response.css('css_to_main'))
l.add_css('variable_name','css_to_one::text')
l.add_css('variable_name','css_to_two::text')
yield l.load_item()
In the item Field, you then process the values using input and output processors. I have omitted the input processor for simplicity. But you can add them as required.
class MyItem(scrapy.Item):
variable_name = scrapy.Field(output_processor = Join(''))