Assume that I have 4 different datasets and 4 GPU like below
4 dataset
dat0 = [np.array(...)], dat1 = [np.array(...)] , dat2 = [np.array(...)] , dat3 = [np.array(...)]
4 GPU
device = [torch.device(f'cuda:{i}') for i in range(torch.cuda.device_count())]
assume all the four data set have already converted into tensor and transfer to 4 different GPU.
Now, I have a function f from other module which can be used on GPU
How can I do the following at the same time,
compute 4 resulf of this
ans0 = f(dat0) on device[0], ans1 = f(dat1) on device[1], ans2 = f(dat2) on device[2], ans3 = f(dat3) on device[3]
then move all the 4 ans back to cpu then calculate the sum
ans = ans0 + ans1 + ans2 + ans3
Assuming you only need ans
for inference. You can easily perform those operations but you will certainly need function f
to be on all four GPUs at the same time.
Here is what I would try: duplicate f
four times and send to each GPU. Then compute the intermediate result, sending back each result to the CPU for the final operation:
fns = [f.clone().to(device) for device in devices]
results = []
for fn, data in zip(fns, datasets):
result = fn(data).detach().cpu()
results.append(result)
ans = torch.stack(results).sum(dim=0)