I have created a spark dataframe reading csv from hdfs location.
emp_df = spark.read.format("com.databricks.spark.csv") \
.option("mode", "DROPMALFORMED") \
.option("header", "true") \
.option("inferschema", "true") \
.option("delimiter", ",").load(PATH_TO_FILE)
and saving this dataframe as Hive paritioned orc table using partitionBy method
emp_df.repartition(5, 'emp_id').write.format('orc').partitionBy("emp_id").saveAsTable("UDB.temptable")
when I am reading this table as below method and If I look at the logical and physical plan, it seems that it has perfectly filtered the data using partition key column:
emp_df_1 = spark.sql("select * from UDB.temptable where emp_id ='6'")
emp_df_1.explain(True)
***************************************************************************
== Parsed Logical Plan ==
'Project [*]
+- 'Filter ('emp_id = 6)
+- 'UnresolvedRelation `UDB`.`temptable`
== Analyzed Logical Plan ==
emp_name: string, emp_city: string, emp_salary: int, emp_id: int
Project [emp_name#7399, emp_city#7400, emp_salary#7401, emp_id#7402]
+- Filter (emp_id#7402 = cast(6 as int))
+- SubqueryAlias temptable
+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc
== Optimized Logical Plan ==
Filter (isnotnull(emp_id#7402) && (emp_id#7402 = 6))
+- Relation[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] orc
== Physical Plan ==
*(1) FileScan orc udb.temptable[emp_name#7399,emp_city#7400,emp_salary#7401,emp_id#7402] Batched: true, Format: ORC, Location: PrunedInMemoryFileIndex[hdfs://pathlocation/database/udb....,
PartitionCount: 1, PartitionFilters: [isnotnull(emp_id#7402), (emp_id#7402 = 6)], PushedFilters: [], ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>
***************************************************************************
whereas If I read this dataframe via absolute hdfs path location, it seems that it is not able to filter the data using partition key column:
emp_df_2 = spark.read.format("orc").load("hdfs://pathlocation/database/udb.db/temptable/emp_id=6")
emp_df_2.explain(True)
******************************************************************************
== Parsed Logical Plan ==
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Analyzed Logical Plan ==
emp_name: string, emp_city: string, emp_salary: int
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Optimized Logical Plan ==
Relation[emp_name#7411,emp_city#7412,emp_salary#7413] orc
== Physical Plan ==
*(1) FileScan orc [emp_name#7411,emp_city#7412,emp_salary#7413] Batched: true, Format: ORC, Location: InMemoryFileIndex[hdfs://pathlocation/data/database/udb.db/tem...,
PartitionFilters: [], PushedFilters: [], ReadSchema: struct<emp_name:string,emp_city:string,emp_salary:int>
********************************************************************************
Could you please help me to understand the logical and physical plan in both the cases?
In your second example partition location is already covered in HDFS path. You can still put parent directory as a path and make use of partitioning with the following code:
full_dataset_df = spark.read.format("orc") \
.load("hdfs://pathlocation/database/udb.db/temptable")
one_partition_df = full_dataset_df.where(full_dataset_df.emp_id == 6)
It's worthy to mention that no matter which of these 3 methods you will use, the data processing performance will be the same.