Search code examples
aws-glue

AWS Glue for Mongo to Parquet file in S3


Can we use AWS Glue for the following?

  1. Extract data from MongoDB
  2. Convert to Parquet file and store the data in S3

Solution

  • Yes this can be done using "connectionType": "mongodb" as Source in your Glue ETL job, refer to this for syntax.

    Also this has below example which read data from mongodb which then can be written to S3 in parquet file format.

    mongo_uri = "mongodb://<mongo-instanced-ip-address>:27017"
    
    
    read_mongo_options = {
        "uri": mongo_uri,
        "database": "test",
        "collection": "coll",
        "username": "username",
        "password": "pwd",
        "partitioner": "MongoSamplePartitioner",
        "partitionerOptions.partitionSizeMB": "10",
        "partitionerOptions.partitionKey": "_id"}
    
    dynamic_frame = glueContext.create_dynamic_frame.from_options(connection_type="mongodb",
                                                                  connection_options=read_mongo_options)    
    

    Once you have the data then you write data back to s3 using below statement after doing any transformations that you wanted to perform:

    glueContext.write_dynamic_frame.from_options(frame = dynamic_frame,
              connection_type = "s3",
              connection_options = {"path": "s3://glue-parquet/output-dir"},
              format = "parquet")