I'm looking for a key/value store in python that would be suitable for holding (and caching) HTTP responses (content, HTTP headers, timestamp) keyed by request URL. The application is a web scraping engine, where several sites are queried on a regular basis. A set of routines then analyses the scraped data.
The options I've investigated so far include:
python shelve module (fast but data can't be distributed, write by single process only)
mongodb (relatively fast, so far the best fit for what I am looking for)
couchdb (too slow for this application)
memcached (not suitable because the store is not persistent, and cached data can't be replicated, correct me if I am wrong)
Some performance results using real scraped data:
python shelve: 3500 reads/second
couchdb (couchdbkit): 33 reads/second
mongodb (pymongo): 2300 reads/second
redis: 1200 reads/second
I ended up using a capped collection in mongodb. Each entry holds the url (primary key), content and headers. Since capped collections doesn't allow removal, the content is set to Null to indicate a cached entry has been removed.