Search code examples
pythonjsonamazon-s3web-scrapingscrapy

How to upload crawled data from Scrapy to Amazon S3 as csv or json?


What are the steps to upload the crawled data from Scrapy to the Amazon s3 as a csv/jsonl/json file? All i could find from the internet was to upload scraped images to the s3 bucket.

I'm currently using Ubuntu 16.04, and i have installed boto by the command,

pip install boto

I have added the following lines to settings.py. Can anyone explain the other changes i have to make.

AWS_ACCESS_KEY_ID = 'access key id'
AWS_SECRET_ACCESS_KEY= 'access key'


FEED_URI = 'bucket path'
FEED_FORMAT = 'jsonlines'
FEED_EXPORT_FIELDS = None
FEED_STORE_EMPTY = False
FEED_STORAGES = {}
FEED_STORAGES_BASE = { 
'': None,
'file': None,
'stdout': None,
's3': 'scrapy.extensions.feedexport.S3FeedStorage',
'ftp': None,
}
FEED_EXPORTERS = {}
FEED_EXPORTERS_BASE = {
    'json': None,
    'jsonlines': None,
    'jl': None,
    'csv': None,
    'xml': None,
    'marshal': None,
    'pickle': None,
}

Edit 1 : When i configure all the above and run scrapy crawl spider, I get the following error after the crawled results.

2016-08-08 10:57:03 [scrapy] ERROR: Error storing csv feed (200 items) in: s3: myBucket/crawl.csv
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 246, in inContext
result = inContext.theWork()
File "/usr/lib/python2.7/dist-packages/twisted/python/threadpool.py", line 262, in <lambda>
inContext.theWork = lambda: context.call(ctx, func, *args, **kw)
File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 118, in callWithContext
return self.currentContext().callWithContext(ctx, func, *args, **kw)
File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
return func(*args,**kw)
File "/usr/local/lib/python2.7/dist-packages/scrapy/extensions/feedexport.py", line 123, in _store_in_thread
key.set_contents_from_file(file)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 1293, in set_contents_from_file
chunked_transfer=chunked_transfer, size=size)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 750, in send_file
chunked_transfer=chunked_transfer, size=size)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/key.py", line 951, in _send_file_internal
query_args=query_args
File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 656, in make_request
auth_path = self.calling_format.build_auth_path(bucket, key)
File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 94, in build_auth_path
path = '/' + bucket
TypeError: cannot concatenate 'str' and 'NoneType' objects

Solution

  • The problem was solved by adding the following line into settings.py file:

    ITEM_PIPELINE = {
    'scrapy.pipelines.files.S3FilesStore': 1
    }
    

    along with the S3 credentials mentioned earlier.

    AWS_ACCESS_KEY_ID = 'access key id'
    AWS_SECRET_ACCESS_KEY= 'access key'
    
    FEED_URI='s3://bucket/folder/filename.json'
    

    Thank you guys for your guidance.