Search code examples
couchdbcouchbasecloudanttouchdbiriscouch

The pricing programs of hosted CouchDB providers do not make sense


The holy grail of CouchDB is its replication feature. With TouchDB, Cloudant-Sync and Couchbase-Lite you can even replicate a database from\to the users' smartphones, so the data will be available even if there are connectivity problems.

The CouchDB replication protocol (which may be implemented slightly differently across different frameworks\sdks) makes a GET request for every document that has changed.

Both Cloudant and Iris-Couch provide pricing programs that are based on the size of the database, the number of light http requests (GET, HEAD) and the number of heavy http requests (PUT, POST, DELETE). This means that calling a GET for a single document has the same price as calling a GET to /_all_docs.

In some sense, it looks like the replication protocol is very inefficient when it comes to these pricing programs. For example, if your users only pull documents from the server, it may be cheaper to use /_all_docs?include_docs=true than running a standard replication, even if the /_all_docs request makes you download documents that did not change...

Am I missing something? Shouldn't the pricing programs consider the amount of data being downloaded\uploaded instead of the number of requests? Shouldn't a GET request of a single document be much cheaper than calling /_all_docs or views? Could the replication protocol be tweaked so it would be less efficient in terms of bandwidth but much cheaper?

P.S. I know that Couchbase is a separate project and the the CouchDB replication protocol is irrelenent to it. Couchbase also support replication from\to clients (via Couchbase Lite). Is there any way to compare the two mechanisms, in terms of number of requests to the server?

--- EDIT ---

It looks like /_all_docs is being used in the Couchbase-Lite replication algorithm, not to reduce the cost but to optimize the process: https://github.com/couchbase/couchbase-lite-ios/wiki/Replication-Algorithm

  • A limited case of the above-mentioned bulk-get optimization is possible with the standard API: revisions of generation 1 (revision ID starts with “1-”) can be fetched in bulk via _all_docs, because by definition they have no revision histories. Unfortunately _all_docs can’t include attachment bodies, so if it returns a document whose JSON indicates it has attachments, those will have to be fetched separately. Nonetheless, this optimization can help significantly, and is currently implemented in Couchbase Lite.

-- EDIT --

This issue is being handled in Couchbase Sync Gateway, not as a part of CouchDB: https://github.com/couchbase/sync_gateway/wiki/Bulk-GET

I wonder if this is ever going to be implemented in CouchDB. It looks like the service providers that charge per request don't have an interest to support this feature...


Solution

  • You have a point and then again it does not matter.

    Why you have a point

    Indeed running a single /_all_docs request is only a single request returning all of your documents. You just found a way to cheat you host into giving you a 'free service'.

    Why it does not matter

    • Replication needs to be efficient so you really don't want to have the slave couch check every document that may have been updated against _all_docs in the master. Even if you really wanted to do that, to retain reasonable consistency, the updates would likely only see a small level of change so if 1 in a 1000 documents gets updated between 2 replications, then the overhead cost for replicating by document is pretty small.

    • Assume you run a blog/application that queries _all_docs to minimze the requests. Well done, if your application is meant to be responsive and you need 5 kByte of documents from a database with 50 MByte database, you just lost a whole lot of users because you'll be as unresponsive as anything.

    • You optimize at the wrong end. You will typically hit a $ 20 limit when having around 1 million get requests. If you have a website with that sort of level of traffic and you run Ads on it, you'll likely manage to get well in excess of $500 (assuming eCPM of $0.5). You'll be much more likely to increase your revenue by adding content than by squeezing the cost of your couchdb.