I have data partitioned in one way, I just want to partition it in another. So it basically gonna be something like this:
sqlContext.read().parquet("...").write().partitionBy("...").parquet("...")
I wonder does this will trigger shuffle or all data will be re-partition locally, because in this context a partition means just a directory in HDFS and data from the same partition doesn't have to be on the same node to be written in the same dir in HDFS.
Neither partitionBy
nor bucketBy
shuffles the data. There are cases though, when repartitioning data first can be a good idea:
df.repartition(...).write.partitionBy(...)
Otherwise the number of the output files is bounded by number of partitions * cardinality of the partitioning column.