Search code examples
pythonparquetazure-data-lakeazure-databricksazure-data-lake-gen2

Azure Databricks - Write Parquet file to Curated Zone


While writing parquet file back to DataLake Gen2 is creating additional files.

Example:

%python
rawfile = "wasbs://[email protected]/xxxx/2019-09-30/account.parquet"
curatedfile = "wasbs://[email protected]/xxxx-Curated/2019-09-30/account.parquet"
dfraw = spark.read.parquet(rawfile)
dfraw.write.parquet(curatedfile, mode = "overwrite")
display(dfraw)

enter image description here

File name supplied (account.parquet) is accounted to created folder rather creating file with that name.

How can these additional files be ignored and file written with name supplied.


Solution

  • When a user writes a file in a job, DBIO will perform the following actions for you.

    • Tag files written with the unique transaction id.
    • Write files directly to their final location.
    • Mark the transaction as committed when the jobs commits.

    It's not possible to do it directly to change the file name in Spark's save.

    Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. You can easily change filename after processing just like in the SO thread.

    You may refer similar SO thread, which addressed similar issue.

    Hope this helps.