Cuda memory error while running haystack prompt node with gpu

I am having Cuda ran out of memory issue while running this code:

prompt_node = PromptNode(model_name_or_path = 'google/flan-t5-xl',
default_prompt_template=lfqa_prompt,
use_gpu=True,
max_length=300)

I tried to solve the issue with Cuda. I am using GPU with the retriever, and it works fine. Only having the issue when I use the prompt node with GPU. Any suggestion on how to fix it?

The error is:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 14.85 GiB total capacity; 4.02 GiB already allocated; 17.44 MiB free; 4.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.

Solution

For the model you are using, 'google/flan-t5-xl', there are some smaller alternatives, such as 'google/flan-t5-small' or 'google/flan-t5-base'. They require less memory and that would be my suggestion here.

Quantization would be a different approach. Haystack doesn't support quantization out of the box yet but I believe it wouldn't be to difficult to add so maybe you can make a feature request through a GitHub issue?

In the particular error message you posted, it seems that not all of the GPU memory is used. For some reason it seems to be limited to 4GiB out of the 14.85 GiB. Could well be that it's not related to the model but a bug in torch or with the execution environment. Have you tried running it in a fresh environment? You might want to check whether your problem is similar to one of the following torch issues: https://github.com/pytorch/pytorch/issues/40002 or https://github.com/pytorch/pytorch/issues/67680