Search code examples
couchdbmapreducecouchapp

CouchDB: Return Newest Documents of Type Based on Timestamp


I have a system that accepts status updates from a variety of unique sources, and each status update creates a new document in the following structure:

{
 "type": "status_update",
 "source_id": "truck1231",
 "timestamp": 13023123123,
 "location": "Boise, ID"
}

Data purely example, but gets the idea across.

Now, these documents are generated at interval, once an hour or so. An hour later, we might the insert:

{
 "type": "status_update",
 "source_id": "truck1231",
 "timestamp": 13023126723,
 "location": "Madison, WI"
}

All I'm interested in doing is seeing the latest update from each unique source. I'm currently doing this by taking a map of:

function(doc) {
  if (doc.type == "status_update") {
    emit(doc.source_id, doc);
  }
}

And a reduce of:

function(keys, values, rereduce) {
  var winner = values[0];
  var i = values.length;
  while (i--) {
    var val = values[i];
    if (val.timestamp > winner.timestamp) winner = val;
  }
  return winner;
}

And querying the data as a reduce with group=true. This works as expected and provides a keyed result of only the latest updates.

The problem is it's terribly slow and requires me to reduce_limit=false in the CouchDB config.

It feels like there has to be a more efficient way of doing this. Updating the same document is not an option-- the history is important even though I don't require it in this instance. Processing the data client-side isn't an option either as this is a CouchApp and the amount of documents in the system is actually quite large and not practical to send them all over the wire.

Thanks in advance.


Solution

  • You can get the latest timestamp for every source using the _stats built-in reduce function, then do another query to get the documents. Here's the views:

    "views": {
      "latest_update": {
        "map": "function(doc) { if (doc.type == 'status_update') emit(doc.source_id, doc.timestamp); }",
        "reduce": "_stats"
      },
      "status_update": {
        "map": "function(doc) { if (doc.type == 'status_update') emit([doc.source_id, doc.timestamp], 1); }"
      }
    }
    

    First query latest_update with group=true, then status_update with something like (properly url-encoded):

    keys=[["truck123",TS123],["truck234",TS234],...]&include_docs=true
    

    where TS123, and TS234 are the values of max returned by latest_update.