python machine-learning pytorch google-colaboratory

High GPU RAM Usage Training Large Language Model with Small Dataset on A100

I'm encountering an issue with excessive GPU RAM consumption while training a large language model on a relatively small dataset. Despite using only 200 rows of data, the training process consumes around 40 GB of RAM on an Nvidia A100 GPU, which seems disproportionately high for the dataset size.

Environment:

Model: vilsonrodrigues/falcon-7b-instruct-sharded (a variant of a large language model with 7 billion parameters)
Dataset Size: 200 rows
GPU: Nvidia A100
Framework: PyTorch & Google Colab,
Transformers library by Hugging Face (specify version)
Training Configuration:
Batch size: 1
- Gradient accumulation steps: 16
- Mixed precision (FP16) enabled

Code: https://colab.research.google.com/drive/1TNra_fwJbQ9M3FsFB1z8sniOxLzkDJfo?usp=sharing

Issue: The training consumes around 40 GB of GPU RAM, which seems excessive for the small size of the dataset and the training configuration used. I've already employed strategies such as reducing the batch size, enabling mixed precision training, and using gradient accumulation to manage memory usage, but the issue persists.

OutOfMemoryError: CUDA out of memory. Tried to allocate 316.00 MiB. GPU 0 has a total capacty of 15.77 GiB of which 240.38 MiB is free. Process 36185 has 15.54 GiB memory in use. Of the allocated memory 15.19 GiB is allocated by PyTorch, and 43.11 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Questions:

Are there any recommended strategies to further reduce GPU RAM usage for training large language models on small datasets?
Could there be potential misconfigurations or inefficiencies in my training setup that I might be overlooking?
Is there a way to optimize the utilization of the A100 GPU for such training tasks to prevent excessive memory usage?

Solution

The GPU RAM consumption usually only depends on the model weights and batch size, not on the dataset size.

At each training iteration, the model weights are loaded and the output is computed for all the data points in a batch, it doesn't matter if the data has 1 or 1 million points. You can try with 1 data point to see if anything changes (it shouldn't).

Some strategies you might try to reduce memory consumption are quantization and LoRA https://pytorch.org/blog/finetune-llms/.