Search code examples

Does dataFrameWriter partitionBy shuffle the data?

I have data partitioned in one way, I just want to partition it in another. So it basically gonna be something like this:"...").write().partitionBy("...").parquet("...")

I wonder does this will trigger shuffle or all data will be re-partition locally, because in this context a partition means just a directory in HDFS and data from the same partition doesn't have to be on the same node to be written in the same dir in HDFS.


  • Neither partitionBy nor bucketBy shuffles the data. There are cases though, when repartitioning data first can be a good idea:


    Otherwise the number of the output files is bounded by number of partitions * cardinality of the partitioning column.