mongodb apache-spark apache-spark-sql spark-shell

Spark SQL and MongoDB query execution times on the same data don't produce expected results

This is a general question but I am hoping someone can answer it. I am comparing query execution times between MongoDB and Spark SQL. Specifically I have created a MongoDB collection of 1 million entries from a .csv file and ran a few queries on it using the mongosh in Compass. Then using Spark Shell and the Spark - MongoDB connector I inserted this database from MongoDB into Spark as an RDD. After that I converted the RDD into a Dataframe and started running Spark SQL queries on it. I ran the same queries on Spark SQL as in MongoDB while calculating the query execution times in both instances. The result was that in fairly simple queries like

SELECT ... FROM ... WHERE ... ORDER BY ...

MongoDB was significantly faster than Spark SQL. In one of those examples the respective execution times were around 800ms for MongoDB while for Spark SQL it was around 1800ms

From my understanding a Spark dataframe automatically makes the code distribute and run in parallel so the Spark SQL query should be faster than the MongoDB query. Can anyone explain?

Solution

I was right, Spark SQL should be faster on the Dataframe compared to the MongoDB queries. It appears that using the connector to import the data is what causes the execution time problem. I tried inserting the data from a .csv file straight into a Dataframe and the query time was much faster than both the query on imported data(from MongoDB Connector) and MongoDB.