Search code examples
scalaapache-sparkapache-spark-sqlapache-spark-sql-repartition

Apache Spark What happens when repartition($"key") is called when size of all records per key is greater than the size of a single partition?


Suppose I have a dataframe of 10GB with one of the column's "c1" having same value for every record. Each single partition is maximum 128 MB(default value). Suppose i call repartition($"c1"), then will all the records be shuffled to the same partition? If so, wouldn't it exceed the maximum size per partition? How would repartition work in this case?


Solution

  • The configuration spark.sql.files.maxPartitionBytes is effective only when reading files from file-based sources. So when you execute repartition, you reshuffle your existing Dataframe and the number of output partitions will be defined by repartition logic, which in your case will be 1.