performance apache-spark bigdata cluster-computing distributed-computing

Tasks taking longer over time in Apache Spark

I have a large dataset that I am trying to run with Apache Spark (around 5TB). I have noticed that when the job starts, it retrieves data really fast and the first stage of the job (a map transformation) gets done really fast.

However, after having processed around 500GB of data, that map transformation starts being slow and some of the tasks are taking several minutes or even hours to complete.

I am using 10 machines with 122 GB and 16CPUs and I am allocating all resources to each of the worker nodes. I thought about increasing the number of machines, but is there any other thing I could be missing?

I have tried with a small portion of my data set (30 GB) and it seemed to be working fine.

Solution

It seems that the stage gets completed locally in some nodes faster than in others. Driven from that observation, here is what I would try:

Cache the RDD that you process. Do not forget to unpersist it, when you don't need it anymore. Understanding caching, persisting in Spark.
Check if the partitions are balanced, which doesn't seem to be the case (that would explain why some local stages complete much earlier than others). Having balanced partitions is the holy grail in distributed-computing, isn't it? :) How to balance my data across the partitions?
Reducing the communications costs, i.e. use less workers than you use, and see what happens. Of course that heavily depends on your application. You see, sometimes communication costs become so big, they dominate, so using less machines for example, speeds up the job. However, I would do that, only if steps 1 and 2 would not suffice.