Search code examples
apache-sparkhadoopapache-spark-sqlhdfspartitioning

Does dataFrameWriter partitionBy shuffle the data?


I have data partitioned in one way, I just want to partition it in another. So it basically gonna be something like this:

sqlContext.read().parquet("...").write().partitionBy("...").parquet("...")

I wonder does this will trigger shuffle or all data will be re-partition locally, because in this context a partition means just a directory in HDFS and data from the same partition doesn't have to be on the same node to be written in the same dir in HDFS.


Solution

  • Neither partitionBy nor bucketBy shuffles the data. There are cases though, when repartitioning data first can be a good idea:

    df.repartition(...).write.partitionBy(...)
    

    Otherwise the number of the output files is bounded by number of partitions * cardinality of the partitioning column.