apache-spark pyspark apache-spark-sql partitioning parquet

Worker Behavior with two (or more) dataframes having the same key

I'm using PySpark (Spark 1.4.1) in a cluster. I have two DataFrames each containing the same key values but different data for the other fields.

I partitioned each DataFrame separately using the key and wrote a parquet file to HDFS. I then read the parquet file back into memory as a new DataFrame. If I join the two DataFrames, will the processing for the join happen on the same workers?

For example:

dfA contains {userid, firstname, lastname} partitioned by userid
dfB contains {userid, activity, job, hobby} partitioned by userid

dfC = dfA.join(dfB, dfA.userid==dfB.userid)

Is dfC already partitioned by userid?

Solution

Is dfC already partitioned by userid

The answer depends on what you mean by partitioned. Records with the same userid should be located on the same partition, but DataFrames don't support partitioning understood as having a Partitioner. Only the PairRDDs (RDD[(T, U)]) can have partitioner in Spark. It means that for most applications the answer is no. Neither DataFrame or underlaying RDD is partitioned.

You'll find more details about DataFrames and partitioning in How to define partitioning of DataFrame? Another question you can follow is Co-partitioned joins in spark SQL.

If I join the two DataFrames, will the processing for the join happen on the same workers?

Once again it depends on what you mean. Records with the same userid have to be transfered to the same node before transformed rows can be yielded. I ask if it is guaranteed to happen without any network traffic the answer is no.

To be clear it would be exactly the same even if DataFrame had a partitioner. Data co-partitioning is not equivalent to data co-location. It simply means that join operation can be performed using one-to-one mapping not shuffling. You can find more in Daniel Darbos' answer to Does a join of co-partitioned RDDs cause a shuffle in Apache Spark?.