Search code examples
apache-sparkdistributed-computing

How are partitions assigned to tasks in Spark


Let's say I'm reading 100 files from an S3 folder. Each file is of size 10 MB. When I execute df = spark.read.parquet(s3 path), how do the files (or rather partitions) get distributed across tasks? E.g. in this case df is going to have 100 partitions, and if spark has 10 tasks running for reading contents of this folder into the data frame, how the partitions are getting assigned to the 10 tasks? Is it in a round-robin fashion, or each task gets equal proportions of all partitions in a range based distribution, or something else? Any pointer to relevant resources would also be very helpful. Thank you.


Solution

  • Tasks are directly proportional to the number of partitions.

    Spark tries to partition the rows directly from original partitions without bringing anything to the driver.

    The partition logic is to start with a randomly picked target partition and then assign partitions to the rows in a round-robin method. Note that "start" partition is picked for each source partition and there could be collisions.

    The final distribution depends on many factors: a number of source/target partitions and the number of rows in your dataframe.