Search code examples
scrapyscrapyd

Where does scrapyd write crawl results when using an S3 FEED_URI, before uploading to S3?


I'm running a long-running web crawl using scrapyd and scrapy 1.0.3 on an Amazon EC2 instance. I'm exporting jsonlines files to S3 using these parameters in my spider/settings.py file:

FEED_FORMAT: jsonlines FEED_URI: s3://my-bucket-name

My scrapyd.conf file sets the items_dir property to empty:

items_dir=

The reason the items_dir property is set to empty is so that scrapyd does not override the FEED_URI property in the spider's settings, which points to an s3 bucket (see Saving items from Scrapyd to Amazon S3 using Feed Exporter).

This works as expected in most cases but I'm running into a problem on one particularly large crawl: the local disk (which isn't particularly big) fills up with the in-progress crawl's data before it can fully complete, and thus before the results can be uploaded to S3.

I'm wondering if there is any way to configure where the "intermediate" results of this crawl can be written prior to being uploaded to S3? I'm assuming that however Scrapy internally represents the in-progress crawl data is not held entirely in RAM but put on disk somewhere, and if that's the case, I'd like to set that location to an external mount with enough space to hold the results before shipping the completed .jl file to S3. Specifying a value for "items_dir" prevents scrapyd from automatically uploading the results to s3 on completion.


Solution

  • The S3 feed storage option inherits from BlockingFeedStorage, which itself uses TemporaryFile(prefix='feed-') (from tempfile module)

    The default directory is chosen from a platform-dependent list

    You can subclass S3FeedStorage and override the open() method to return a temp file from somewhere else than the default, for example using the dir argument of tempfile.TemporaryFile([mode='w+b'[, bufsize=-1[, suffix=''[, prefix='tmp'[, dir=None]]]]])