Question
I want to know how to disable Item storing in scrapyd.
What I tried
I deploy a spider to the Scrapy daemon Scrapyd. The deployed spider stores the spidered data in a database. And it works fine.
However Scrapyd logs each scraped Scrapy item. You can see this when examining the scrapyd web interface.
This item data is stored in ..../items/<project name>/<spider name>/<job name>.jl
I have no clue how to disable this. I run scrapyd in a Docker container and it uses way too much storage.
I have tried suppress Scrapy Item printed in logs after pipeline, but this does nothing for scrapyd logging it seems. All spider logging settings seem to be ignored by scrapyd.
Edit
I found this entry in the documentation about Item storing. It seems if you omit the items_dir
setting, item logging will not happen. It is said that this is disabled by default. I do not have a scrapyd.conf file, so item logging should be disabled. It is not.
After writing my answer I re-read your question and I see that what you want has nothing to do with logging but it's about not writing to the (default-ish) .jl
feed (Maybe update the title to: "Disable scrapyd Item storing"). To override scrapyd's default, just set FEED_URI
to an empty string like this:
$ curl http://localhost:6800/schedule.json -d project=tutorial -d spider=example -d setting=FEED_URI=
For other people who are looking into logging... Let's see an example. We do the usual:
$ scrapy startproject tutorial
$ cd tutorial
$ scrapy genspider example example.com
then edit tutorial/spiders/example.py
to contain the following:
import scrapy
class TutorialItem(scrapy.Item):
name = scrapy.Field()
surname = scrapy.Field()
class ExampleSpider(scrapy.Spider):
name = "example"
start_urls = (
'http://www.example.com/',
)
def parse(self, response):
for i in xrange(100):
t = TutorialItem()
t['name'] = "foo"
t['surname'] = "bar %d" % i
yield t
Notice the difference between running:
$ scrapy crawl example
# or
$ scrapy crawl example -L DEBUG
# or
$ scrapy crawl example -s LOG_LEVEL=DEBUG
and
$ scrapy crawl example -s LOG_LEVEL=INFO
# or
$ scrapy crawl example -L INFO
By trying such combinations on your spider confirm that it doesn't print Item info for log-level beyond debug.
It's now time, after you deploy to scrapyd to do exactly the same:
$ curl http://localhost:6800/schedule.json -d setting=LOG_LEVEL=INFO -d project=tutorial -d spider=example
Confirm that the logs don't contain items when you run:
Note that if your items are still printed in INFO level, it likely means that your code or some pipeline is printing it. You could rise log-level further and/or investigate and find the code that prints it and remove it.