Azure Databricks creating unnecessary folders

Azure Databricks creating random folders while writing and merging.

I run the below query in databricks:

df.write.format('delta').mode('overwrite').save("abfss://[email protected]/some_path/events")

When I am checking the azure storage UI, I am seeing some folders:

What are these folder xJ? And why it is getting created?

The description of the query:

engineInfo: Databricks-Runtime/13.2.x-scala2.12
isolationLevel: WriteSerializable

Solution

The specific folder name 'xJ' is likely a randomly generated identifier that Delta Lake uses internally. It's part of the structure that Delta Lake employs to manage data efficiently and reliably.
These internal folders are not meant to be directly accessed or modified by users. They are essential for Delta Lake's functioning and are managed automatically by the Delta Lake system.

I have tried both Local and Global spark sessions and perfromed the write and Merge operations to ADLS:

Below is the Global spark session:

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()
data = [('Alice', 34), ('Bob', 55), ('Charlie', 45)]
columns = ['name', 'age']
df = spark.createDataFrame(data, columns)
df.write.format('delta').mode('overwrite').save("abfss://[email protected]/1_path/events")
new_data = [('Dave', 28), ('Eva', 38)]
new_df = spark.createDataFrame(new_data, columns)
new_df.write.format('delta').mode('append').save("abfss://[email protected]/1_path/events")

enter image description here

By following this Global spark session approach, you can ensure that the operations are consistent and that you avoid any unexpected folder creation or naming issues during the write and merge operations in Azure Databricks using Delta Lake.