Search code examples
pytorchcudagpunvidiagpgpu

Memory issue in running multiple processes on GPU


This question can be viewed related to my other question.

I tried running multiple machine learning processes in parallel (with bash). These are written using PyTorch. After a certain number of concurrent programs (10 in my case), I get the following error:

RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

As mentioned in this answer,

...it could occur because the VRAM memory limit was hit (which is rather non-intuitive from the error message).

For my case with PyTorch model training, decreasing batch size helped. You could try this or maybe decrease your model size to consume less VRAM.

I tried the solution mentioned here, to enforce a per-process GPU memory usage limit, but this issue persists.

This problem does not occur with a single process, or a fewer number of processes. Since only one context runs at a single time instant, why does this cause memory issue?

This issue occurs with/without MPS. I thought it could occur with MPS, but not otherwise, as MPS may run multiple processes in parallel.


Solution

  • Since only one context runs at a single time instant, why does this cause memory issue?

    Context-switching doesn't dump the contents of GPU "device" memory (i.e. DRAM) to some other location. If you run out of this device memory, context switching doesn't alleviate that.

    If you run multiple processes, the memory used by each process will add up (just like it does in the CPU space) and GPU context switching (or MPS or time-slicing) does not alleviate that in any way.

    It's completely expected that if you run enough processes using the GPU, eventually you will run out of resources. Neither GPU context switching nor MPS nor time-slicing in any way affects the memory utilization per process.