gpu amazon-sagemaker endpoint large-language-model llama

Deploying LLM on Sagemaker Endpoint - CUDA out of Memory

I am trying to deploy huggingface LLM (for inference) to Sagemaker Endpoint using custom scripts (Using Pytorch framework with model and inference script zipped as .tar.gz file). The tar.gz file structure is:

model.tar.gz/
|- pytorch_model.bin
|- ....
|- code/
  |- inference.py
  |- requirements.txt

In inference.py, I have defined functions model_fn and predict_fn.

This tar.gz file is uploaded to S3 and the model while deployment is being picked from this S3 location.

I have followed the process defined in https://huggingface.co/docs/sagemaker/en/inference --> Sections: Create a model artifact for deployment and User defined code and modules

After following all these steps, I am getting an error :

CUDA out of memory. Tried to allocate 20.00 MiB. GPU 1 has a total capacty of 22.20 GiB of which 13.12 MiB is free. Process 13234 has 2.25 GiB memory in use. Process 13238 has 3.82 GiB memory in use. Process 13236 has 8.06 GiB memory in use. Process 13239 has 8.06 GiB memory in use. Of the allocated memory 6.93 GiB is allocated by PyTorch, and 49.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF : 400

My model is an LLM with 7b parameters and compute is ml.g5.12x (192 GB and GPU 24 GB x 4). The memory is more than sufficient (as I was getting this error, I tried such a large compute) and the code I have tried is using AutoModelForCausalLM.from_pretrained and Autotokenizer.from_pretrained. I have tried device maps of "auto", balanced_low_0, and balanced. The memory on GPU is sufficient to start with (as checked by me from memory summary)

The thing is I was able to get a response for a couple of pings and then I started getting this error. I am clearing the cache in my predict function but still I am getting this error.

How can I resolve my out-of-memory error? I get out of memory error either right at the start or my memory of GPU fills incrementally with each inference.

Solution

I had tried a lot of different things as mentioned on various locations on web but none worked for me. The error was due to inefficient GPU memory allocation strategy for LLM (device_map="auto" was not working well for me) and some variables that were getting stored on GPU. I am mentioning variables because out-of-memory was appearing within first four inferences (which meant that empty memory available on GPU was very less to start with and the reason why I mentioned that the strategy of GPU memory allocation was not working for me).

Before I elaborate on my answer, I will list the various things I found at various locations and what finally worked for me (and I believe, most of the users who had faced this issue will most likely benefit from what I tried rather than the things mentioned below, even during training)

Update PYTORCH_CUDA_ALLOC_CONF max_split_size_mb. But this may be the least helpful option. os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:enter-size-here"
A user at a different forum had mentioned that one needs to install following packages: transformers==4.28.1, sentencepiece==0.1.97, accelerate==0.18.0, bitsandbytes==0.37.2 and torch 1.13.1 . But I believe out of memory issues mostly need to be handled by memory management. Package issues might be temporary with new releases but are resolved with certaininity.
Related to Training ONLY - While training for vision models, the images might not fit in GPU solely and so you should adjust it’s size as well as release them from GPU memory.
Related to Training ONLY - Reduce training batch sizes to as small as 1
Garbage collection gc.collect()
Empty cache torch.cuda.empty_cache()
Increase system RAM/larger compute instance

What really helped me was distributing the LLM across GPU’s by defining max_memory of GPU that can be used for storing model. This meant that my GPU was not fully booked by the LLM. It is a three step process:

Load model with no weights/empty model on GPU. While inference set no_grad to avoid any calculations weight updates even though no weights will be updated. Also, Set device map to fix the max memory loaded model weights can take.
Load model weights on CPU
The weights for each layer are loaded to GPU, execution/calculation is done and then, weights are removed from GPU.

    with torch.no_grad():
    
            with init_empty_weights():
                old_prediction_model = AutoModelForCausalLM.from_pretrained(
                    model_dir,
                    torch_dtype=torch.bfloat16,
                    quantization_config=quantization_config
                    )
            model = load_checkpoint_and_dispatch(
            old_prediction_model, offload_folder="/offload_folder_name_or_location",
                checkpoint=model_dir, device_map=infer_auto_device_map(old_prediction_model, max_memory={0: "10GiB"}),
                dtype=torch.bfloat16
        )

NOTE: Along with all this, another big cause of out of memory errors is leaving your variables on GPU i.e. since, execution is happening on GPU and during that course you try to create a list of model inferences or evaluations, the GPU memory will continue to fill up as your new inferences are made. To avoid it, with each inference take your variables off the GPU and to cpu memory