Search code examples
mongodbaggregateparquetamazon-athenapresto

Convert Array to Json in Mongodb Aggregate


I have a Mongo Document in below format:

{
    "id":"eafa3720-28e2-11ed-bf07"
    "type":"test"
    "serviceType_details": [
        {
            "is_custom_service_type": false,
            "bill_amount": 100
        }
    ]
}

"serviceType_details" Key doesn't have any definite schema.

Now I want to export it using MongoDB aggregate to Parquet so that I could use Presto to query it.

My Pipeline Code:

db.test_collection.aggregate([
    {
        $match: {
            "id": "something"
        }
    },
    {
        $addFields: {
            ...
        },
    }
    {
        "$out" : {
            "format" : {
                "name" : "parquet",
                "maxFileSize" : "10GB",
                "maxRowGroupSize" : "100MB"
            }
        }
    }
])

Now I want to export the value of "serviceType_details" in json string not as array ( when using current code parquet recognises it as an array)

I have tried $convert,$project and it's not working.

Currently the generated Parquet schema looks something like this:

enter image description here

I want the generated Parquet schema for "serviceType_details" to have as string and value should be stringify version of array which is present in mongo document.

Reason for me to have need it as string is because in each document "serviceType_details" details have completely different schema, its very difficult to maintain Athena table on top of it.


Solution

  • You can use the $function operator to define custom functions to implement behaviour not supported by the MongoDB Query Language

    It could be done using "$function" like this:

        db.test_collection.aggregate([
        {
            $match: {
                "id": "something"
            }
        },
        {
            $addFields: {
                newFieldName: {
                    $function: {
                        body: function(field) {
                            return (field != undefined && field != null) ? JSON.stringify(field) : "[]"
                        },
                        args: ["$field"],
                            lang: "js"
                    }
                },
            },
        }
        {
            "$out" : {
                "format" : {
                    "name" : "parquet",
                    "maxFileSize" : "10GB",
                    "maxRowGroupSize" : "100MB"
                }
            }
        }
    ])
    

    Executing JavaScript inside an aggregation expression may decrease performance. Only use the $function operator if the provided pipeline operators cannot fulfill your application's needs.