azure-data-factory databricks parquet azure-data-lake

Specify parquet file name when saving in Databricks to Azure Data Lake

Is there a way to specify the name of a parquet file when I am saving it in Databricks to Azure Data Lake? For example, when I try to run the following statement:

append_df.write.mode('append').format('parquet').save('/mnt/adls/covid/base/Covid19_Cases')

a folder called Covid_Cases gets created and there are parquet files with random names inside of it.

What I would like to do is to use the saved parquet file in Data Factory copy activity. In order to do that, I need to specify the parquet file's name, otherwise I can't point to a specific file.

Solution

Since spark is executing in distributed mode and files or their revatives, e.g. dataframes, are being processed in parallel, processed data will be stored in different files in same folder . You can use folder level name to Data Factory copy activity. But you really want to make it single file, you can use below approach ,

save_location= "/mnt/adls/covid/base/Covid19_Cases"+year
parquet_location = save_location+"temp.folder"
file_location = save_location+'export.parquet'

df.repartition(1).write.parquet(path=parquet_location, mode="append", header="true")

file = dbutils.fs.ls(parquet_location)[-1].path
dbutils.fs.cp(file, file_location)
dbutils.fs.rm(parquet_location, recurse=True)