I have been trying to train a BertSequenceForClassification Model using AWS Sagemaker. i'm using hugging face estimators. but I keep getting the error: RuntimeError: CUDA out of memory. Tried to allocate 192.00 MiB (GPU 0; 11.17 GiB total capacity; 10.73 GiB already allocated; 87.88 MiB free; 10.77 GiB reserved in total by PyTorch)
the same code runs fine on my laptop.
print(torch.cuda.memory_summary(device=None, abbreviated=False))
from within my training script (right before it throws the error) it prints|===========================================================================|
| PyTorch CUDA memory summary, device ID 0 |
|---------------------------------------------------------------------------|
| CUDA OOMs: 0 | cudaMalloc retries: 0 |
|===========================================================================|
| Metric | Cur Usage | Peak Usage | Tot Alloc | Tot Freed |
|---------------------------------------------------------------------------|
| Allocated memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Active memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| GPU reserved memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Non-releasable memory | 0 B | 0 B | 0 B | 0 B |
| from large pool | 0 B | 0 B | 0 B | 0 B |
| from small pool | 0 B | 0 B | 0 B | 0 B |
|---------------------------------------------------------------------------|
| Allocations | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Active allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| GPU reserved segments | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|---------------------------------------------------------------------------|
| Non-releasable allocs | 0 | 0 | 0 | 0 |
| from large pool | 0 | 0 | 0 | 0 |
| from small pool | 0 | 0 | 0 | 0 |
|===========================================================================|
but I have no idea what it means or how to interpret it
!df -h
I can see:Filesystem Size Used Avail Use% Mounted on
devtmpfs 30G 72K 30G 1% /dev
tmpfs 30G 0 30G 0% /dev/shm
/dev/xvda1 109G 93G 16G 86% /
/dev/xvdf 196G 61M 186G 1% /home/ec2-user/SageMaker
how is this memory different from the GPU? if theres 200GB in /dev/xvdf is there anyway I can just use that..? in my test script I tried
model = BertForSequenceClassification.from_pretrained(args.model_name,num_labels=args.num_labels).to("cpu")
but that just gives the same error
A CUDA out of memory
error indicates that your GPU RAM (Random access memory) is full. This is different from the storage on your device (which is the info you get following the df -h
command).
This memory is occupied by the model that you load into GPU memory, which is independent of your dataset size. The GPU memory required by the model is at least twice the actual size of the model, but most likely closer to 4 times (initial weights, checkpoint, gradients, optimizer states, etc).
Things you can try: