Search code examples
mongodbindexingtail

MongoDB Oplog Cursor on Sharded Cluster with Compound Index


Having a OpLog cursor, is it possible to get another index in the update operation other than the default _id one?

Background:

I have a sharded cluster in place with a compound index as shard key. One part of this compound key is used to determine, which set of shards is used to store the data (also known as Tag Aware Sharding)

There are some NodeJS Microservices running in the background tailing the OpLogs of the ReplicaSets of the different shards to trigger further processing on data changes. Now if some data gets updated, the only index returned in the OpLog is the default _id, which forces me to query the whole cluster for the second part of the compound index to leverage the whole shard key in further processing.

The application is very write intensive and means one extra query over the whole cluster for each update. I could avoid this query if I could get the whole compound index in the update operation.

Thanks for any input!


Solution

  • As at MongoDB 3.2, the replication oplog does not include details of the shard key or secondary indexes relating to a document. The oplog wasn't designed for your use case; I would suggest watching/upvoting SERVER-13932: Change Notification Stream API in the MongoDB issue tracker.

    Now if some data gets updated, the only index returned in the OpLog is the default _id, which forces me to query the whole cluster for the second part of the compound index to leverage the whole shard key in further processing.

    There are some NodeJS Microservices running in the background tailing the OpLogs of the ReplicaSets of the different shards to trigger further processing on data changes. Now if some data gets updated, the only index returned in the OpLog is the default _id, which forces me to query the whole cluster for the second part of the compound index to leverage the whole shard key in further processing.

    With a sharded cluster you have to tail the oplog on each shard, as you are doing. However, there is a useful property of _id and shard keys for your use case: both are immutable.

    I'm not sure how your microservices are configured to aggregate updates, but if you see an insert or update and you want to find out more information about the shard key for that document you only need to query one shard: the one you just observed updating that document.

    So a suggested approach to try is:

    • oplog tailing on shard discovers _id of updated document of interest
    • query local shard for document (by _id) to find the shard key fields
    • read/update document for further processing via mongos using the shard key

    You should test to see if this actually makes a measurable performance difference for your deployment, but this approach would enable queries targeted to a single shard rather than scatter/gather to all shards.

    Obvious caveat: Aside from cheating to discover the shard key by fetching the document from the local shard where you observed an update in the oplog, you definitely want all queries & updates for your sharded cluster to be processed via mongos.