Search code examples
arangodbaql

arangodb get document update date from commit log


Is it possible to obtain a records updated data from Arangodb commit logs if there is such a thing as commit logs. We have a couple of documents which where update but we did update their modified date field. We however would like to retrieve all updated/changed documents since a certain date.


Solution

  • There are two solutions for this:

    Solution one:

    The first solution is to not use the commit log, but run an AQL query on the collection and filter on the modified date field. This will be efficient if there is a sorted index (i.e. skiplist index) on the modified field.

    An example setup for this can be found in the following setup script, which populates a collection test with 50K documents with random modification dates:

    /* use some fixed base date to make query produce results */
    var baseDate = 1478779081650; /* 2016-11-10T11:58:01.650Z */
    db._create("test");
    db.test.ensureIndex({ type: "skiplist", fields: [ "modified" ]});
    
    /* create 50,000 documents with modified dates between
       2016-11-10T11:58:01.650Z and up to two years in the past */
    for (var i = 0; i < 50000; ++i) {
      db.test.insert({ value: i, modified: new Date(baseDate - Math.floor(Math.random() * 1000 * 60 * 60 * 24 * 365 * 2)).toISOString() });
    }
    

    Then using AQL it's straight-forward to find documents with a modified date higher than a specific value:

    var query = "FOR doc IN test FILTER doc.modified >= @date RETURN doc"; 
    /* find all documents modified since 2016-11-09T12:00:00.000Z */
    var docs = db._query(query, { date: "2016-11-09T12:00:00.000Z" }).toArray();
    require("internal").print(docs);
    

    It's also possible to do queries on date ranges, e.g.

    var query = "FOR doc IN test FILTER doc.modified >= @from && doc.modified <= @to RETURN doc"; 
    var docs = db._query(query, { from: "2016-11-09T00:00:00.000Z", to: from: "2016-11-09T23:59:59.999Z"  }).toArray();
    require("internal").print(docs);
    

    Solution two:

    The second solution is to use the WAL change log that ArangoDB also exposes via its HTTP API. But this is much more complicated and requires keeping state on the client side.

    The basic idea is to query the WAL change log API at /_api/replication/logger-follow for the given collection. This API call can be given an initial tick value. This controls from where in the change log the request will start looking. In the beginning this tick value is unclear, so simply omit it. Using curl, the call for the collection test would be:

    curl -X GET "http://127.0.0.1:8529/_db/_system/_api/replication/logger-follow?collection=test" --basic --user "root:" --dump -
    

    All call to this API will produce some HTTP headers with state information and the WAL entries for the collection in chronological order, e.g.

    ...
    X-Arango-Replication-Checkmore: true
    X-Arango-Replication-Lastincluded: 6103060
    X-Arango-Replication-Lasttick: 6251758
    ...
    {"tick":"6101295","type":2000,"database":"1","cid":"6101294","cname":"test","data":"cid":"6101294","deleted":false,"doCompact":true,"indexBuckets":8,"isSystem":false,"isVolatile":false,"maximalSize":33554432,"name":"test","type":2,"version":5,"waitForSync":false}}
    {"tick":"6101298","type":2100,"database":"1","cid":"6101294","cname":"test","data":{"fields":["modified"],"id":"6101297","sparse":false,"type":"skiplist","unique":false}}
    {"tick":"6101302","type":2300,"tid":"0","database":"1","cid":"6101294","cname":"test","data":"_id":"test/6101300","_key":"6101300","_rev":"6101300","modified":"2015-06-26T14:18:30.732Z","value":0}}
    {"tick":"6101305","type":2300,"tid":"0","database":"1","cid":"6101294","cname":"test","data":"_id":"test/6101304","_key":"6101304","_rev":"6101304","modified":"2016-11-09T07:14:08.146Z","value":1}}
    {"tick":"6101308","type":2300,"tid":"0","database":"1","cid":"6101294","cname":"test","data":"_id":"test/6101307","_key":"6101307","_rev":"6101307","modified":"2015-05-14T04:45:01.202Z","value":2}}
    ...
    

    As can be seen the change log contains not only the insert/update operations for the documents but also the creation of the collection and the creation of the index. It will also contain all remove operations and other operations that change the meta-data of the collection.

    Using the change log results, you can now filter them on the client side for type 2300, which is a document insert or update operation, and then peek into data.modified of each returned document. You can then use the documents which satisfy your search condition.

    Note that the result of the request may not contain all operations, but it may contain only a fraction of them. It may be necessary to fetch more data from the server. This can be done by calling the API again, now using the value of the X-Arango-Replication-Lastincluded HTTP response header as tick value, e.g.

    curl -X GET "http://127.0.0.1:8529/_db/_system/_api/replication/logger-follow?collection=test&from=6103060" --basic --user "root:" --dump -
    

    This will produce even more operations. You can call the API again and again until it produces no more results and the value of the X-Arango-Replication-Checkmore HTTP response header becomes false. That means you have fetched all operations for the time being.

    This solution requires the client to potentially issue multiple HTTP requests and keep state (the last fetched tick value), so it's not as easy to use as the AQL-based solution.