Search code examples
mongodbspark-streamingmongodb-oplog

each data in mongodb local.oplog.rs, is it a standard bsonobject structure


I use spark mongo-connector to sync data from mongodb collection to hdfs file, my code works fine if the collection is read through mongos, but when it comes to local.oplog.rs, a replica collection only could be read through mongod, it gives me exception:

Caused by: com.mongodb.hadoop.splitter.SplitFailedException: Unable to calculate input splits: couldn't find index over splitting key { _id: 1 }

I think the data structure is different between oplog.rs and normal collection, oplog.rs doesn't have "_id" property, so the newAPIHadoopRDD can not work nomally, is that right?


Solution

  • Yes, Document structure is a bit different in oplog.rs. You will find your actual document in "o" field of oplog document.

    Example oplog document:

    {
    "_id" : ObjectId("586e74b70dec07dc3e901d5f"),
    "ts" : Timestamp(1459500301, 6436),
    "h" : NumberLong("5511242317261841397"),
    "v" : 2,
    "op" : "i",
    "ns" : "urDB.urCollection",
    "o" : {
        "_id" : ObjectId("567ba035e4b01052437cbb27"),
          .... 
         .... this is your original document.
    
          }
    

    }

    Use "ns" and "o" of oplog.rs to get your expected collection and document.