Running LLama2 on a GeForce 1080 8Gb machine

I am trying to run LLama2 on my server which has mentioned nvidia card. It's a simple hello world case you can find here. However I am constantly running into memory issues:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 250.00 MiB (GPU 0; 7.92 GiB total capacity; 7.12 GiB already allocated; 241.62 MiB free; 7.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I tried

export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

but same effect. Is there anything I can do?

Solution

According to this source:

The model you use will vary depending on your hardware. For good results, you should have at least 10GB VRAM at a minimum for the 7B model, though you can sometimes see success with 8GB VRAM.

In order to lower the memory footprint of the model, I first recommend that you try running the model in half precision (if supported) with a batch size of one. If you still experience CUDA out of memory I suggest you to try with a quantized version of the model, like for example these ones.