Search code examples
mongodbapache-sparkapache-spark-dataset

Spark read from MongoDB and filter by objectId indexed field


I'm trying to read a dataset from MongoDB using mongo-spark-connector 2.2.0 with a filter on the _id field.

for example:

MongoSpark.loadAndInferSchema(session,ReadConfig.create(session)).filter(col("_id").getItem("oid").equalTo("590755cd7b868345d6da1f40"));

This query takes a very long time on a big collection. It looks like this query doesn't use the default _id index that I have on the collection, because the filter uses a string instead of objectId. How can I make it use the index?


Solution

  • Mongo Connector by default should push the predicates to mongo so that we can make use of the _id filed, but if that is not working we can use pipeline api to achieve the same, see below example

    val rdd = MongoSpark.load(sc)
    
    val filterRdd = rdd.withPipeline(Seq(Document.parse(" { $match : { _id : "SomeValue" } }")))