I'm currently executing the K-Means algorithm in a cluster.
Between to consecutive executions under the same configuration (same number of executors, RAM, iterations, dataset) the number of tasks, jobs and stages can vary quite a lot. Over 10 executions the number of tasks reached a standard deviation of about 500 tasks.
Is this normal? Shouldn't the DAG be the same under the same configurations?
I'm running the Spark implementing of K-Means using Scala.
That's perfectly normal behavior.
The number of iterations required for K-Means to converge depends on the initial choice of centroids, and the process is either fully (random init mode) or partially (K-Means|| init mode) random.
Since each iteration triggers a job (and creates a separate DAG), number of stages, and consequently tasks, is proportional to the number of iterations performed before satisfying convergence criteria.