What are the key differences between Join and Reduce in terms of batch processing?
The join will wait until all tasks are completed (which needs to merge) but reduce won't wait.
However, in contrast to the join pattern described in above diagram, the goal of reduce is not to wait until all data has been processed, but rather to optimistically merge together all of the parallel data items into a single comprehensive representation of the full set.
This is a fortunate contrast to the join pattern because unlike join, it means that reduce can be started in parallel while there is still processing going on as part of the map/shard phase. Of course, in order to produce a complete output, all of the data must be processed eventually, but the ability to begin early means that the batch computation executes more quickly overall.