Search code examples
amazon-web-servicesdataframepysparkaws-glueaws-glue-spark

glue job schema inference issue


Requirment: I need a glue job to get the aws-dynamodb(nested structure-combination of maps and list) data into s3.

My approach: First, i used glue-dynamic frame to get all the data from dynamodb into one dynamic frame.

datasource = glueContext.create_dynamic_frame.from_options(
             "dynamodb",
              connection_options={
                 "dynamodb.input.tableName": table_name,
                 "dynamodb.throughput.read.percent": read_percentage,
                 "dynamodb.splits": "100",
    }
)

after using this, i got datasource dynamic frame with all the data.

here i want to do some sort of transformation and want to perform some filters, so thats why i used pyspark dataframe concept.

df0 = datasource.toDF()

my input dataframe df0 contains json data collection in the struct format, so i used to_json to convert struct into json-string. here i need json string not the struct.

df1 = df0.select(to_json("collection"))

from df1, i am accessing whatever i want.

Major Issue

some of the attributes present in the collection are appearing like this

collection : { 
              "name" : "aaa",
               "id" : "111" ,
               "address" : "some address",
               "price" : 
                        {"string" : 1212.0 },
               "retailer" :
                         {"string" : "xxxx"},
               "categories" : "array": [
                                       "7216"
                                       ]
}

if you see above example price,reatiler,categories, datatypes are appearing as a nested attribute.

i want output like this

collection : { 
              "name" : "aaa",
               "id" : "111" ,
               "address" : "some address",
               "price" : "1212.0",
               "retailer" :"xxxx",
               "categories" : "[7216]"
}

How can i resolve this, please let me know


Solution

  • The issue you are facing is expected behaviour as Glue gives you choice to chose what datatype you want for a column with ambiguous types within a DynamicFrame.

    ResolveChoice provides information for resolving ambiguous types within a DynamicFrame with multiple options.

    Depending on your requirement you can chose any of this option and resolve the issue.