apache-spark pyspark partitioning azure-databricks delta-lake

What is the best practice to load a delta table specific partition in databricks?

I would like to know what is the best way to load a delta table specific partition ? Is option 2 loading the all table before filtering ?

option 1 :

df = spark.read.format("delta").option('basePath','/mnt/raw/mytable/')\
   .load('/mnt/raw/mytable/ingestdate=20210703')

(Is the basePath option needed here ?)

option 2 :

df = spark.read.format("delta").load('/mnt/raw/mytable/')
df = df.filter(col('ingestdate')=='20210703')

Many thanks in advance !

Solution

In the second option, spark loads only the relevant partitions that has been mentioned on the filter condition, internally spark does partition pruning and load only the relevant data from source table.

Whereas in the first option, you are directly instructing spark to load only the respective partitions as defined.

So in both the cases, you will end up loading only the respective partitions data.

Fetching data from REST API to Spark Dataframe using Pyspark
Create column using Spark pandas_udf, with dynamic number of input columns
How to find position of substring column in another column using PySpark?
How to correctly read a CSV file while escaping delimiter comma placed within square brackets using Apache Spark and Scala?
SPARK SQL Equivalent of Qualify + Row_number statements
How to drop a column from a Databricks Delta table?
Converting all columns in spark df from decimal to float for pandas conversion
How to create a copy of a dataframe in pyspark?
Read previous Spark APIs
Unexpected output from least (source data includes nulls)
How to use PySpark UDF in Java / Scala Spark project
How does spark load python package depends on the external library?
Disable PySpark to print info when running
PySpark: How To Deserialise A Proto Payload From A Kafka Message With Variable Message Type
Multiple Sinks Processing not persisting in Databricks Community Edition
How to find longest sequence of consecutive dates?
graph.triplets seems not work as expected
PySpark MongoDB :: java.lang.NoClassDefFoundError: com/mongodb/client/model/Collation
How do I access the fields within a VARIANT column while reading from Kafka using Spark?
pyspark: how to specify rebalance partitioning hint with columns
Is Python UDF still inefficient in Spark?
How to import AnalysisException in PySpark
Updated scalapb class fails to render old dataframe
Create a Column with Values Based on an Array of Column Names Provided in Another Column
How to join on multiple columns in Pyspark?
Databricks: Issue while creating spark data frame from pandas
How to use SparkSQLparse in a simple FROM analysis?
UnsatisfiedLinkError while writing to S3 using Staging S3A Committer on Windows
How to install postgresql in my docker image?
Why Spark won't store Broadcasted data in off heap memory? Why does it store one copy per executor?