Search code examples
apache-sparkpyspark

Internals of worker/executor usage during coalesce/repartition


Lets say we have spark cluster with below configuration.

  • 1 Driver and 1000 workers
  • Each worker has 5 executer
  • Each executer has 4 core

Ho does coalesce/repartition works, when we perform coalesce (50) or repartition(50)

  1. does Spark creates 50 partition each on a separate worker ie. it uses 50 workers or
  2. many partition can be in single worker for ex: In above case since there are 5 executor/worker, so each worker will have 5 partition ie 1 partition/executor, so spark uses only 10 workers?

Solution

    1. It creates 50 partitions in total across the whole cluster.
    2. The question is not absolutely clear but basically partitions are not tied to workers. They can be shuffled and it all depends on the nature of your operations with your data. They also can be distributed across the workers unevenly so e.g. 1 worker will have 20 partitions, 30 more workers will have 1 partition each and other 969 will idle while the first one will have to grind 20 of them.