Search code examples
pysparkapache-spark-sqldatabricksazure-blob-storageazure-databricks

How to Write a Spark Dataframe (in DataBricks) to Blob Storage (in Azure)?


I am working in DataBricks, where I have a DataFrame.

type(df) 
Out: pyspark.sql.dataframe.DataFrame

The only thing that I want, is to write this complete spark dataframe into an Azure Blob Storage.

I found this post. So I tried that code:

# Configure blob storage account access key globally
spark.conf.set(
  "fs.azure.account.key.%s.blob.core.windows.net" % storage_name,
  sas_key)

output_container_path = "wasbs://%s@%s.blob.core.windows.net" % (output_container_name, storage_name)
output_blob_folder = "%s/wrangled_data_folder" % output_container_path

# write the dataframe as a single file to blob storage
(datafiles
 .coalesce(1)
 .write
 .mode("overwrite")
 .option("header", "true")
 .format("com.databricks.spark.csv")
 .save(output_blob_folder))

Running that code is leading to the error below. Changing the "csv" part for parquet and other formats is also failing.

org.apache.spark.sql.AnalysisException: CSV data source does not support struct<AccessoryMaterials:string,CommercialOptions:string,DocumentsUsed:array<string>,Enumerations:array<string>,EnvironmentMeasurements:string,Files:array<struct<Value:string,checksum:string,checksumType:string,name:string,size:string>>,GlobalProcesses:string,Printouts:array<string>,Repairs:string,SoftwareCapabilities:string,TestReports:string,endTimestamp:string,name:string,signature:string,signatureMeaning:bigint,startTimestamp:string,status:bigint,workplace:string> data type.;

Therefore my question (and this should be easy is my assumption): How can I write my spark dataframe from DataBricks to an Azure Blob Storage?

My Azure folder structure is like this:

Account = MainStorage 
Container 1 is called "Data" # containing all the data, irrelevant because i already read this in. 
Container 2 is called "Output" # here I want to store my Spark Dataframe. 

Many thanks in advance!

EDIT I am using Python. However, I don't mind if the solution is in other languages (as long as DataBricks support them, like R/Scala etc.). If it works, it is perfect :-)


Solution

  • Assuming you have already mounted the blob storage, Use the below approach to write your data frame as a csv format.
    Please note newly created file would have the some default file name with csv extension hence you might need to rename it with a consistent name.

    // output_container_path= wasbs://[email protected]/DirectoryName 
    val mount_root = "/mnt/ContainerName/DirectoryName"
    df.coalesce(1).write.format("csv").option("header","true").mode("OverWrite").save(s"dbfs:$mount_root/")