I'm using PySpark (Spark 1.4.1) in a cluster. I have two DataFrames each containing the same key values but different data for the other fields.
I partitioned each DataFrame separately using the key and wrote a parquet file to HDFS. I then read the parquet file back into memory as a new DataFrame. If I join the two DataFrames, will the processing for the join happen on the same workers?
For example:
dfA
contains {userid
, firstname
, lastname
} partitioned by userid
dfB
contains {userid
, activity
, job
, hobby
} partitioned by userid
dfC = dfA.join(dfB, dfA.userid==dfB.userid)
Is dfC
already partitioned by userid
?
Is
dfC
already partitioned byuserid
The answer depends on what you mean by partitioned. Records with the same userid
should be located on the same partition, but DataFrames
don't support partitioning understood as having a Partitioner
. Only the PairRDDs
(RDD[(T, U)]
) can have partitioner in Spark. It means that for most applications the answer is no. Neither DataFrame
or underlaying RDD
is partitioned.
You'll find more details about DataFrames
and partitioning in How to define partitioning of DataFrame? Another question you can follow is Co-partitioned joins in spark SQL.
If I join the two
DataFrames
, will the processing for the join happen on the same workers?
Once again it depends on what you mean. Records with the same userid
have to be transfered to the same node before transformed rows can be yielded. I ask if it is guaranteed to happen without any network traffic the answer is no.
To be clear it would be exactly the same even if DataFrame
had a partitioner. Data co-partitioning is not equivalent to data co-location. It simply means that join
operation can be performed using one-to-one mapping not shuffling. You can find more in Daniel Darbos' answer to Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?.