Search code examples
scrapyscrapyd

Disable Scrapyd item storing in .jl feed


Question

I want to know how to disable Item storing in scrapyd.

What I tried

I deploy a spider to the Scrapy daemon Scrapyd. The deployed spider stores the spidered data in a database. And it works fine.

However Scrapyd logs each scraped Scrapy item. You can see this when examining the scrapyd web interface. This item data is stored in ..../items/<project name>/<spider name>/<job name>.jl

I have no clue how to disable this. I run scrapyd in a Docker container and it uses way too much storage.

I have tried suppress Scrapy Item printed in logs after pipeline, but this does nothing for scrapyd logging it seems. All spider logging settings seem to be ignored by scrapyd.

Edit I found this entry in the documentation about Item storing. It seems if you omit the items_dir setting, item logging will not happen. It is said that this is disabled by default. I do not have a scrapyd.conf file, so item logging should be disabled. It is not.


Solution

  • After writing my answer I re-read your question and I see that what you want has nothing to do with logging but it's about not writing to the (default-ish) .jl feed (Maybe update the title to: "Disable scrapyd Item storing"). To override scrapyd's default, just set FEED_URI to an empty string like this:

    $ curl http://localhost:6800/schedule.json -d project=tutorial -d spider=example -d setting=FEED_URI=
    

    For other people who are looking into logging... Let's see an example. We do the usual:

    $ scrapy startproject tutorial
    $ cd tutorial
    $ scrapy genspider example example.com
    

    then edit tutorial/spiders/example.py to contain the following:

    import scrapy
    
    class TutorialItem(scrapy.Item):
        name = scrapy.Field()
        surname = scrapy.Field()
    
    class ExampleSpider(scrapy.Spider):
        name = "example"
    
        start_urls = (
            'http://www.example.com/',
        )
    
        def parse(self, response):
            for i in xrange(100):
                t = TutorialItem()
                t['name'] = "foo"
                t['surname'] = "bar %d" % i
                yield t
    

    Notice the difference between running:

    $ scrapy crawl example
    # or
    $ scrapy crawl example -L DEBUG
    # or
    $ scrapy crawl example -s LOG_LEVEL=DEBUG
    

    and

    $ scrapy crawl example -s LOG_LEVEL=INFO
    # or
    $ scrapy crawl example -L INFO
    

    By trying such combinations on your spider confirm that it doesn't print Item info for log-level beyond debug.

    It's now time, after you deploy to scrapyd to do exactly the same:

    $ curl http://localhost:6800/schedule.json -d setting=LOG_LEVEL=INFO -d project=tutorial -d spider=example
    

    Confirm that the logs don't contain items when you run:

    enter image description here

    Note that if your items are still printed in INFO level, it likely means that your code or some pipeline is printing it. You could rise log-level further and/or investigate and find the code that prints it and remove it.