I have a large dataset that I am trying to run with Apache Spark (around 5TB). I have noticed that when the job starts, it retrieves data really fast and the first stage of the job (a map
transformation) gets done really fast.
However, after having processed around 500GB of data, that map
transformation starts being slow and some of the tasks are taking several minutes or even hours to complete.
I am using 10 machines with 122 GB and 16CPUs and I am allocating all resources to each of the worker nodes. I thought about increasing the number of machines, but is there any other thing I could be missing?
I have tried with a small portion of my data set (30 GB) and it seemed to be working fine.
It seems that the stage gets completed locally in some nodes faster than in others. Driven from that observation, here is what I would try: