While writing parquet file back to DataLake Gen2 is creating additional files.
Example:
%python
rawfile = "wasbs://xxxx@dxxxx.blob.core.windows.net/xxxx/2019-09-30/account.parquet"
curatedfile = "wasbs://xxxx@xxxx.blob.core.windows.net/xxxx-Curated/2019-09-30/account.parquet"
dfraw = spark.read.parquet(rawfile)
dfraw.write.parquet(curatedfile, mode = "overwrite")
display(dfraw)
File name supplied (account.parquet) is accounted to created folder rather creating file with that name.
How can these additional files be ignored and file written with name supplied.
When a user writes a file in a job, DBIO will perform the following actions for you.
It's not possible to do it directly to change the file name in Spark's save.
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. You can easily change filename after processing just like in the SO thread.
You may refer similar SO thread, which addressed similar issue.
Hope this helps.