I am trying to get inference from HuggingFace Transformer model running using Pytorch Framework. I have a GPU instance running and when I am checking the cuda memory summary, I find that allocated memory (Total Allocation) is increasing by 100's of GB's with each inference e.g. after 2nd inference Allocated memory (Total Allocation) was 19GB, with 3rd inference Allocated memory (Total Allocation) was 205GB. This total allocation is freed up. The memory maps don't show any anomalous pattern. Current usage and peak usage from nearly constant. My inference sagemaker instance has 128GB of CPU memory only and 24 GB of GPU memory.
So, I have three queries/concerns:
Tot Alloc
and Tot Freed
show the total amount of memory allocated or freed over the memory snapshot. They are accumulated stats.
Cur Usage
shows how much memory is currently being used by your process. Peak Usage
shows the highest amount of memory used at a single time.
Peak Usage
is the main value you should be concerned about - you will get a cuda memory error if this value tries to exceed your GPU's memory.
Additionally, the cuda memory profiler profiles cuda memory, not system memory. The values shown have nothing to do with system/CPU memory.