Search code examples
pythonmongodbcouchdbdistributed-caching

What would be the best way to persistently store HTTP responses for use in a web scraping application


I'm looking for a key/value store in python that would be suitable for holding (and caching) HTTP responses (content, HTTP headers, timestamp) keyed by request URL. The application is a web scraping engine, where several sites are queried on a regular basis. A set of routines then analyses the scraped data.

The options I've investigated so far include:

  • python shelve module (fast but data can't be distributed, write by single process only)

  • mongodb (relatively fast, so far the best fit for what I am looking for)

  • couchdb (too slow for this application)

  • memcached (not suitable because the store is not persistent, and cached data can't be replicated, correct me if I am wrong)

Some performance results using real scraped data:

python shelve:           3500 reads/second
couchdb (couchdbkit):      33 reads/second
mongodb (pymongo):       2300 reads/second
redis:                   1200 reads/second                   

Solution

  • I ended up using a capped collection in mongodb. Each entry holds the url (primary key), content and headers. Since capped collections doesn't allow removal, the content is set to Null to indicate a cached entry has been removed.