CNTK: Data parallel training in Python without using 1-bit SGD

I am looking at the documentation here: https://github.com/Microsoft/CNTK/wiki/Multiple-GPUs-and-machines

According to the text : "Data-Parallel SGD may be used with or without 1bit-SGD."

However, following in this document there is only a data-parallel relevant section using 1-bit SGD: "Data-Parallel Training with 1-bit SGD" with the following code:

distributed_learner = distributed.data_parallel_distributed_learner(
    learner = learner,
    num_quantization_bits = 1,
    distributed_after = distributed_after)  # warm start: don't use 1-bit SGD for first epoch

If I choose not to use 1-bit SGD (skip the relevant parameters in the above call), I would think that I should still get the parallelization benefits of data_parallel_distributed_learner. Could you confirm that this is the case?

Thank you

Solution

You can set num_quantization_bits to 32 and it will be direct synchronous parallel learning.

You should be warned that depending on your network, setting num_quantization_bits to 32 may slow down your training.

If your CNTK build supports NCCL, then using 32 bit wouldn't slow down too much. 1-bit SGD itself has computation cost (for quantization) that you should be aware of.