python gpu amazon-sagemaker huggingface-transformers

CUDA: RuntimeError: CUDA out of memory - BERT sagemaker

I have been trying to train a BertSequenceForClassification Model using AWS Sagemaker. i'm using hugging face estimators. but I keep getting the error: RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 11.17 GiB total capacity; 10.73 GiB already allocated; 87.88 MiB free; 10.77 GiB reserved in total by PyTorch) the same code runs fine on my laptop.

how do I check what is occupying that 10GB of memory? my dataset is pretty small (68kb), so is my batch size (8) and epochs (1). When I run nvidia-smi, i can only see "No processes running" and the GPU memory usage is zero. When I run print(torch.cuda.memory_summary(device=None, abbreviated=False)) from within my training script (right before it throws the error) it prints

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Active memory         |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| GPU reserved memory   |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |       0 B  |       0 B  |       0 B  |       0 B  |
|       from large pool |       0 B  |       0 B  |       0 B  |       0 B  |
|       from small pool |       0 B  |       0 B  |       0 B  |       0 B  |
|---------------------------------------------------------------------------|
| Allocations           |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Active allocs         |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| GPU reserved segments |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       0    |       0    |       0    |       0    |
|       from large pool |       0    |       0    |       0    |       0    |
|       from small pool |       0    |       0    |       0    |       0    |
|===========================================================================|

but I have no idea what it means or how to interpret it

when i run !df -h I can see:

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         30G   72K   30G   1% /dev
tmpfs            30G     0   30G   0% /dev/shm
/dev/xvda1      109G   93G   16G  86% /
/dev/xvdf       196G   61M  186G   1% /home/ec2-user/SageMaker

how is this memory different from the GPU? if theres 200GB in /dev/xvdf is there anyway I can just use that..? in my test script I tried
model = BertForSequenceClassification.from_pretrained(args.model_name,num_labels=args.num_labels).to("cpu") but that just gives the same error

Solution

A CUDA out of memory error indicates that your GPU RAM (Random access memory) is full. This is different from the storage on your device (which is the info you get following the df -h command).

This memory is occupied by the model that you load into GPU memory, which is independent of your dataset size. The GPU memory required by the model is at least twice the actual size of the model, but most likely closer to 4 times (initial weights, checkpoint, gradients, optimizer states, etc).

Things you can try:

Provision an instance with more GPU memory
Decrease batch size
Use a different (smaller) model