Search code examples
pythonscrapyscrapyd

Scrapy how to ignore items with blank fields using Loader


I would like to know how to ignore items that don't fill all fields, some kind of droping, because in the output of scrapyd I'm getting pages that don't fill all fields.

I have that code:

class Product(scrapy.Item):
    source_url = scrapy.Field(
        output_processor = TakeFirst()
    )
    name = scrapy.Field(
        input_processor = MapCompose(remove_entities),
        output_processor = TakeFirst()
    )
    initial_price = scrapy.Field(
        input_processor = MapCompose(remove_entities, clear_price),
        output_processor = TakeFirst()
    )
    main_image_url = scrapy.Field(
        output_processor = TakeFirst()
    )

Parser:

def parse_page(self, response):
    try:
        l = ItemLoader(item=Product(), response=response)
        l.add_value('source_url', response.url)
        l.add_css('name', 'h1.title-product::text')
        l.add_css('main_image_url', 'div.pics a img.zoom::attr(src)')

        l.add_css('initial_price', 'ul.precos li.preco_normal::text')
        l.add_css('initial_price', 'ul.promocao li.preco_promocao::text')

        return l.load_item()

    except Exception as e:
        print self.log("#1 ERRO: %s" % e), response.url

I want to do it with Loader without need to create with my own Selector (to avoid processing items twice). I guess that I can drop them in pipeline but probably it's not the best way because these items aren't valid.


Solution

  • Validation of data is one of typical use case for pipelines. In your case you only need to write some small amount of code to check for required fields, something along the lines of:

    from scrapy.exceptions import DropItem
    
    class YourPersonalPipeline(object):
        def process_item(self, item, spider):
            required_fields = [] # your list of required fields
            if all(field in item for field in required_fields):
                return item
            else:
                raise DropItem("your reason")
    

    You need to enable pipeline in settings.py Read more in scrapy docs.