Based on what research I've done, I suspect that a key-value store is NOT the way to go, but I wanted to get more directed input to:
I have an application that consists of many "documents". These are currently being stored in a sort of CMIS repository. The application, however, only ever interacts with these documents after they've been indexed into elasticsearch. This means that ALL read operations will hit elasticsearch, and all write operations will update both elasticsearch and the repository.
Requested features have revealed that the current repository is much too strict and that there's zero reason to enforce a model schema at that level. This, of course, has led to an investigation in NoSQL options.
In order to populate these "documents" into the elasticsearch index, they need to live somewhere and I must be able to get all and paginate through them as they load into the index (there's also some aggregation that occurs at this step in order to populate fields that are built off of existing fields).
Right now, the get all is actually being done in stages based on the type of document, but this requirement may be negotiable and instead a plain get all of all types could suffice but would not be ideal.
In my understanding of key-value stores, the store knows nothing about the values it stores, and they can only be referenced by a key. This causes me to wonder if I could even perform a get all when I don't plan on maintaining a full list of the keys anywhere. I've seen that some key-value stores support using dictionaries as the key (redis). I'm not sure if this means I could query by type (if it were an entry in the dictionary) or if I would need to know the full dictionary to be able to fetch the value?
Since the population of the index should only need to happen if there was an elasticsearch failure, performance is not my top priority (but it certainly would not hurt). To me, MongoDB seems to be a near perfect fit. I can store documents and easily query by type.
In case it matters, for document stores I've been comparing CouchDB, Couchbase, and MongoDB. For key-value stores I've been looking at Redis and BerkeleyDB.
In Redis you can get all the keys and values, with a bit of work and the following commands:
The SCAN command is also conveniently implemented to dump everything in 'redis-cli --scan', as well as in many client libraries (eg Python).
You might need to write something to get this to work for your particular scenario, hopefully shouldn't be too difficult.
NB: there is a KEYS command (which does similar thing to SCAN) which is not recommended for live production use. Although nothing stops you to build a separate independent slave instance, replicate from the master, disconnect from master, and then use the slave as you wish without any impact on anything serving live traffic.