Search code examples
pythonpysparkaws-glueaws-documentdb

AWS Glue 4.0 failing when calling DynamicFrame.fromDF


I'm trying to convert a Spark data frame in Python 3.10 into a dynamic frame using Glue's fromDF method
from awsglue.dynamicframe import DynamicFrame
DynamicFrame.fromDF(frame, glue_context, "node")

But this throws an error on CloudWatch saying com.mongodb.spark.sql.connector.exceptions.MongoSparkException: Partitioning failed.

Inspecting the logs further, I found that under the hood Glue seems to be using $collStats somewhere, which is not supported by AWS DocumentDB. Note that this function worked when the job was Glue 2.0, but updating it to 4.0 has started causing this issue.
com.mongodb.MongoCommandException: Command failed with error 304: 'Aggregation stage not supported: '$collStats''

I haven't really tried fixing this issue, because this seems to be happening under the hood, and I don't have access to the source code of Glue 4.0. The only thing is that this did not fail in Glue 2.0.


Solution

  • What worked for me is setting the partitioner in the batch read configuration for mongo-spark to be set to SinglePartitionPartitioner. Seems like the default was updated to use SamplePartitioner in Glue 4.0 and the newer mongo-spark connector versions which was causing the problem. Refer the documentation