Search code examples
azure-data-factorydatabricksparquetazure-data-lake

Specify parquet file name when saving in Databricks to Azure Data Lake


Is there a way to specify the name of a parquet file when I am saving it in Databricks to Azure Data Lake? For example, when I try to run the following statement:

append_df.write.mode('append').format('parquet').save('/mnt/adls/covid/base/Covid19_Cases')

a folder called Covid_Cases gets created and there are parquet files with random names inside of it.

What I would like to do is to use the saved parquet file in Data Factory copy activity. In order to do that, I need to specify the parquet file's name, otherwise I can't point to a specific file.


Solution

  • Since spark is executing in distributed mode and files or their revatives, e.g. dataframes, are being processed in parallel, processed data will be stored in different files in same folder . You can use folder level name to Data Factory copy activity. But you really want to make it single file, you can use below approach ,

    save_location= "/mnt/adls/covid/base/Covid19_Cases"+year
    parquet_location = save_location+"temp.folder"
    file_location = save_location+'export.parquet'
    
    df.repartition(1).write.parquet(path=parquet_location, mode="append", header="true")
    
    file = dbutils.fs.ls(parquet_location)[-1].path
    dbutils.fs.cp(file, file_location)
    dbutils.fs.rm(parquet_location, recurse=True)