I'm adding a column for the timestamp the job was run on the glue. I want to save it by using partitionBy(load_timestamp)
. A folder was created as e.g. load_timestamp=2020-04-27 03:21:54.
I want the folder to be named as table_name=2020-04-27 03:21:54.
Is this possible?
enriched = df.withColumn("load_timestamp", unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
enriched.write.partitionBy("load_timestamp").format("parquet").mode("append").save("s3://s3-enriched-bucket/" + job_statement[0])
By default Spark creates directories
based on the partition column i.e.
<partition_column_name>=<value>
Easiest way
to fix is to keep column name as table_name
and use in partition by clause.
enriched = df.withColumn("table_name", unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
enriched.write.partitionBy("table_name").format("parquet").mode("append").save("s3://s3-enriched-bucket/" + job_statement[0])
Other way would be:
Renaming the directories by iterating using hadoop.fs file API
and change load_timestamp
to table_name
.