I am using glue job. Glue job reads the input as manifest file which has JSON datafiles in it. After reading it to the dataframe we apply some kind of processing/trasnformations and then Glue job writes the data to a different S3 location in Parquet format. But the datatype of the parquet file columns is getting mapped to object and not string.
This is the code I tried:
df = df.withColumn("my_column", col("my_column").cast("string"))
And I am using below code to write it on s3
write_to_s3 = GLUE_CONTEXT.write_dynamic_frame.from_options(
frame=transformed_with_contracts,
connection_type='s3',
format='parquet',
connection_options={
'path': f's3://{destination_s3_bucket}/{destination_s3_path}'
},
format_options={},
transformation_ctx='write_to_s3',
)
Datatype of the parquet file columns is getting mapped to object and not string, which is not expected. Any ideas?
From the above comments I can conclude that the issue is not with the Glue job, it is writing the parquet files as expected
In Pandas, since strings data types have variable length, it is by default stored as object dtype. Please see the below workaround
Tested it with awswrangler module, which treats the column as string rather than object.
import awswrangler as wr
df = wr.s3.read_parquet(path="s3://<bucket>/<path>/<file>.parquet")
df.to_parquet("s3://<bucket>/path>/<file>.parquet")
df.info()
Try using this instead!