In chainer, How to write BPTT updater using multiple GPUs?

I don't find example because existing example only extends training.StandardUpdater, thus only use One GPU.

Solution

I assume that you are talking about the BPTTUpdater of the ptb example of Chainer.

It's not straight forward to make the customized updater support learning on multiple GPUs. The MultiprocessParallelUpdater hard code the way to compute the gradient (only the target link implementation is customizable), so you have to copy the overall implementation of MultiprocessParallelUpdater and modify the gradient computation parts. What you have to copy and edit is chainer/training/updaters/multiprocess_parallel_updater.py.

There are two parts in this file that compute gradient; one in _Worker.run, which represents a worker process task, and the other in MultiprocessParallelUpdater.update_core, which represents the master process task. You have to make these code do BPTT by modifying the code starting from _calc_loss to backward in each of these two parts:

# Change self._master into self.model for _Worker.run code
loss = _calc_loss(self._master, batch)
self._master.cleargrads()
loss.backward()

It should be modified by inserting the code of BPTTUpdater.update_core.

You also have to take care on the data iterators. MultiprocessParallelUpdater accept the set of iterators that will be distributed to master/worker processes. Since the ptb example uses a customized iterator (ParallelSequentialIterator), you have to make sure that these iterators iterate over different portions of the dataset or using different initial offsets of word positions. It may require customization to ParalellSequentialIterator as well.