I am looking at the documentation here: https://github.com/Microsoft/CNTK/wiki/Multiple-GPUs-and-machines
According to the text : "Data-Parallel SGD may be used with or without 1bit-SGD."
However, following in this document there is only a data-parallel relevant section using 1-bit SGD: "Data-Parallel Training with 1-bit SGD" with the following code:
distributed_learner = distributed.data_parallel_distributed_learner(
learner = learner,
num_quantization_bits = 1,
distributed_after = distributed_after) # warm start: don't use 1-bit SGD for first epoch
If I choose not to use 1-bit SGD (skip the relevant parameters in the above call), I would think that I should still get the parallelization benefits of data_parallel_distributed_learner. Could you confirm that this is the case?
Thank you
You can set num_quantization_bits
to 32 and it will be direct synchronous parallel learning.
You should be warned that depending on your network, setting num_quantization_bits
to 32 may slow down your training.
If your CNTK build supports NCCL, then using 32 bit wouldn't slow down too much. 1-bit SGD itself has computation cost (for quantization) that you should be aware of.