Search code examples
pythonpysparkdatabricksazure-databricksdelta-lake

How to read empty delta partitions without failing in Azure Databricks?


I'm looking for a workaround. Sometimes our automated framework will read delta partitions, that does not exist. It will fail because no parquet files are in this partition.

I don't want it to fail.

What I do then is :

spark_read.format('delta').option("basePath",location) \
  .load('/mnt/water/green/date=20221209/object=34')

Instead, I want it to return the empty dataframe. Return a dataframe with no records.

I did that, but found it a bit cumbersome, and was wondering if there was a better way.

df = spark_read.format('delta').load(location)
folder_partition = /date=20221209/object=34'.split("/")
for folder_pruning_token in folder_partition :
  folder_pruning_token_split = folder_pruning_token.split("=")
  column_name = folder_pruning_token_split[0]
  column_value = folder_pruning_token_split[1]
  df = df .filter(df [column_name] == column_value) 

Solution

  • You really don't need to do that trick with Delta Lake tables. This trick was primarily used for Parquet & other file formats to avoid scanning of files on HDFS or cloud storage that is very expensive.

    You just need to load data, and filter data using where/filter. It's similar to what you do:

    df = spark_read.format('delta').load(location) \
      .filter("date = '20221209' and object = 34")
    

    If you need, you can of course extract that values automatically, maybe slightly simpler code:

    df = spark_read.format('delta').load(location)
    folder_partition = '/date=20221209/object=34'.split("/")
    cols = [f"{s[0]} = '{s[1]}'"
      for s in [f.split('=')for f in folder_partition]
    ]
    df = df.filter(" and ".join(cols))