Search code examples
apache-sparkk-meansapache-spark-ml

Does the number of tasks, stages and jobs vary when executing a Spark application under the same configuration?


I'm currently executing the K-Means algorithm in a cluster.

Between to consecutive executions under the same configuration (same number of executors, RAM, iterations, dataset) the number of tasks, jobs and stages can vary quite a lot. Over 10 executions the number of tasks reached a standard deviation of about 500 tasks.

Is this normal? Shouldn't the DAG be the same under the same configurations?

I'm running the Spark implementing of K-Means using Scala.


Solution

  • That's perfectly normal behavior.

    The number of iterations required for K-Means to converge depends on the initial choice of centroids, and the process is either fully (random init mode) or partially (K-Means|| init mode) random.

    Since each iteration triggers a job (and creates a separate DAG), number of stages, and consequently tasks, is proportional to the number of iterations performed before satisfying convergence criteria.