Search code examples
pysparkaws-glue

Rename the folder created as a result of partitionBy


I'm adding a column for the timestamp the job was run on the glue. I want to save it by using partitionBy(load_timestamp). A folder was created as e.g. load_timestamp=2020-04-27 03:21:54. I want the folder to be named as table_name=2020-04-27 03:21:54. Is this possible?

enriched = df.withColumn("load_timestamp", unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
enriched.write.partitionBy("load_timestamp").format("parquet").mode("append").save("s3://s3-enriched-bucket/" + job_statement[0])

Solution

  • By default Spark creates directories based on the partition column i.e.

    <partition_column_name>=<value>

    Easiest way to fix is to keep column name as table_name and use in partition by clause.

    enriched = df.withColumn("table_name", unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
    
    enriched.write.partitionBy("table_name").format("parquet").mode("append").save("s3://s3-enriched-bucket/" + job_statement[0])
    

    Other way would be:

    Renaming the directories by iterating using hadoop.fs file API and change load_timestamp to table_name.