Suppose I have a dataframe of 10GB with one of the column's "c1" having same value for every record. Each single partition is maximum 128 MB(default value). Suppose i call repartition($"c1"), then will all the records be shuffled to the same partition? If so, wouldn't it exceed the maximum size per partition? How would repartition work in this case?
The configuration spark.sql.files.maxPartitionBytes
is effective only when reading files from file-based sources. So when you execute repartition
, you reshuffle your existing Dataframe and the number of output partitions will be defined by repartition
logic, which in your case will be 1.