I am new to Spark and I am currently try to understand the architecture of spark. As far as I know, the spark cluster manager assigns tasks to worker nodes and sends them partitions of the data. Once there, each worker node performs the transformations (like mapping etc.) on its own specific partition of the data.
What I don't understand is: where do all the results of these transformations from the various workers go to? are they being sent back to the cluster manager / driver and once there reduced (e.g. sum of values of each unique key)? If yes, is there a specific way this happens?
Would be nice if someone is able to enlighten me, neither the spark docs nor other Resources concerning the architecture haven't been able to do so.
Good question, I think you are asking how does a shuffle work...
Here is a good explanation. When does shuffling occur in Apache Spark?