I use spark mongo-connector to sync data from mongodb collection to hdfs file, my code works fine if the collection is read through mongos, but when it comes to local.oplog.rs, a replica collection only could be read through mongod, it gives me exception:
Caused by: com.mongodb.hadoop.splitter.SplitFailedException: Unable to calculate input splits: couldn't find index over splitting key { _id: 1 }
I think the data structure is different between oplog.rs and normal collection, oplog.rs doesn't have "_id" property, so the newAPIHadoopRDD can not work nomally, is that right?
Yes, Document structure is a bit different in oplog.rs. You will find your actual document in "o" field of oplog document.
Example oplog document:
{
"_id" : ObjectId("586e74b70dec07dc3e901d5f"),
"ts" : Timestamp(1459500301, 6436),
"h" : NumberLong("5511242317261841397"),
"v" : 2,
"op" : "i",
"ns" : "urDB.urCollection",
"o" : {
"_id" : ObjectId("567ba035e4b01052437cbb27"),
....
.... this is your original document.
}
}
Use "ns" and "o" of oplog.rs to get your expected collection and document.