azure azure-databricks azure-data-factory databricks-community-edition

Saving pyspark dataframe from azure databricks to azure data lake in a specific folder created dynamically

I am doing some ETL process in Azure.

1. Source data is in Azure data lake
2. Processing it in Azure databricks
3. Loading the output dataframe in Azure data lake to a specific folder 
   considering Current year / Month / date and then file name in csv format.

I am stuck in 3rd step -

1. Tried loading the dataframe to mnt location
   outPath = '/dbfs/mnt/abcd/<outputfoldername>/' + cy_year + "/" + 
   cy_month + "/" + cy_date + "/"

df.coalesce(1).write.mode("overwrite").format("com.databricks.spark.csv").opt 
 ion("header","true").csv(outPath)

This is saving data to DBFS but not to ADLS as suggested by many links over internet.

2. Tried working like - 
   spark.conf.set("fs.azure.account.key.<storage account 
   name>.dfs.core.windows.net", "<<ACCESS KEY")

  output_container_path = "abfss://<container- 
  name>@salesdetails.dfs.core.windows.net/<dir path>"

  df.coalesce(1).write.format("csv").mode("overwrite").option("header", 
  "true").format("com.databricks.spark.csv").save(output_container_path)

This is saving into data into ADLS but into 4 files. 3 are supported ones. I want only one final file name example abc.csv

3. Tried with pandas dataframe which gives us flexibility to name the 
   file name but here we will need specific folder name which is not the 
   case with me.

Please assist at the earliest. Many thanks in advance

Solution

Follow these steps:

1.Mount your storage account with azure data lake gen 2 as per MS_Doc.

configs = {"fs.azure.account.auth.type": "OAuth",
          "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
          "fs.azure.account.oauth2.client.id": "f4dab6c8-5009-4857xxxxxxxxxxxxx",
          "fs.azure.account.oauth2.client.secret":"3GF8Q~3ZGkgflxxxxxxxxx",
          "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/72f988bfxxxxxxx/oauth2/token"}

dbutils.fs.mount(
  source = "abfss://[email protected]/",
  mount_point = "/mnt/abcd11",
  extra_configs = configs)

enter image description here

2.Configure your storage account and read the Source data is in Azure data lake.

df11 = spark.read.format("csv").load("abfss://<container>@<Storage_acccount>.dfs.core.windows.net/")

display(df11)

enter image description here

Sample_Code:

from pyspark.sql.functions import year, month, dayofmonth
from datetime import datetime

now = datetime.now()
year = now.year
month = now.month
day = now.day

folder12 = "/mnt/abcd1/{}/{}/{}/output.csv".format(year, month, day)

# write the dataframe into the  folder in CSV format
df1.write.option("header", "true").csv(folder12, mode="overwrite")

enter image description here

Output:

enter image description here