Search code examples
memory-managementpytorchhuggingface-transformerslarge-language-model

Pytorch CUDA Allocated memory is going into 100's of GB


I am trying to get inference from HuggingFace Transformer model running using Pytorch Framework. I have a GPU instance running and when I am checking the cuda memory summary, I find that allocated memory (Total Allocation) is increasing by 100's of GB's with each inference e.g. after 2nd inference Allocated memory (Total Allocation) was 19GB, with 3rd inference Allocated memory (Total Allocation) was 205GB. This total allocation is freed up. The memory maps don't show any anomalous pattern. Current usage and peak usage from nearly constant. My inference sagemaker instance has 128GB of CPU memory only and 24 GB of GPU memory.

enter image description here

So, I have three queries/concerns:

  1. How is it possible that total allocation memory is more than sagemaker instance on which inference is running.
  2. How do i control this anomalous behaviour?
  3. Is this a concern that I need to rectify, as the inference seems to be running fine.

Solution

  • Tot Alloc and Tot Freed show the total amount of memory allocated or freed over the memory snapshot. They are accumulated stats.

    Cur Usage shows how much memory is currently being used by your process. Peak Usage shows the highest amount of memory used at a single time.

    Peak Usage is the main value you should be concerned about - you will get a cuda memory error if this value tries to exceed your GPU's memory.

    Additionally, the cuda memory profiler profiles cuda memory, not system memory. The values shown have nothing to do with system/CPU memory.