Search code examples
aws-glue

Glue job writing Parquet Files to S3 with InCorrect Datatypes


I am using glue job. Glue job reads the input as manifest file which has JSON datafiles in it. After reading it to the dataframe we apply some kind of processing/trasnformations and then Glue job writes the data to a different S3 location in Parquet format. But the datatype of the parquet file columns is getting mapped to object and not string.

This is the code I tried:

df = df.withColumn("my_column", col("my_column").cast("string"))
And I am using below code to write it on s3
write_to_s3 = GLUE_CONTEXT.write_dynamic_frame.from_options(
        frame=transformed_with_contracts,
        connection_type='s3',
        format='parquet',
        connection_options={
            'path': f's3://{destination_s3_bucket}/{destination_s3_path}'
        },
        format_options={},
        transformation_ctx='write_to_s3',
    )

Datatype of the parquet file columns is getting mapped to object and not string, which is not expected. Any ideas?


Solution

  • From the above comments I can conclude that the issue is not with the Glue job, it is writing the parquet files as expected

    In Pandas, since strings data types have variable length, it is by default stored as object dtype. Please see the below workaround

    Tested it with awswrangler module, which treats the column as string rather than object.

    import awswrangler as wr
    df = wr.s3.read_parquet(path="s3://<bucket>/<path>/<file>.parquet")
    df.to_parquet("s3://<bucket>/path>/<file>.parquet")
    df.info()
    

    Try using this instead!