Search code examples
tensorflowtensorflow-estimator

What are collective ops in Tensorflow?


CollectiveAllReduce documentation mentions 'collective ops':

It is similar to the MirroredStrategy but it uses collective ops for reduction.

The question is simple, what are these?


Solution

  • Even though this is a bit of an old question, i thought i'd might as well answer.

    When it comes to mirroring strategies, Tensorflow (2.0) has 2 types, the MirroredStrategy and the MultiWorkerMirroredStrategy. The MirrorStrategy mirrors the variables on each replica - where a single replica is created per GPU on the machine. On the the hand, the MultiWorkerMirroredStrategy copies the variables on all the workers in a cluster. This is why the multi-worker one would need the TF_CONFIG environment variable setup.

    As per the documentation, CollectiveOps help keeping the variables in sync between the devices. These ops perform gather, broadcast, reduce and other functionalities, collectively across the different workers.