Search code examples
apache-sparkpysparkpartitioningazure-databricksdelta-lake

What is the best practice to load a delta table specific partition in databricks?


I would like to know what is the best way to load a delta table specific partition ? Is option 2 loading the all table before filtering ?

option 1 :

df = spark.read.format("delta").option('basePath','/mnt/raw/mytable/')\
   .load('/mnt/raw/mytable/ingestdate=20210703')

(Is the basePath option needed here ?)

option 2 :

df = spark.read.format("delta").load('/mnt/raw/mytable/')
df = df.filter(col('ingestdate')=='20210703')

Many thanks in advance !


Solution

  • In the second option, spark loads only the relevant partitions that has been mentioned on the filter condition, internally spark does partition pruning and load only the relevant data from source table.

    Whereas in the first option, you are directly instructing spark to load only the respective partitions as defined.

    So in both the cases, you will end up loading only the respective partitions data.