GPU running out of memory with a batch of just 1

I'm trying to use "facebook/bart-large-mnli" HuggingFace model to do some 'inference' on some embeddings.

I first do globaly:

model = AutoModelForSequenceClassification.from_pretrained(
            "facebook/bart-large-mnli"
        )
model = model.to(self.device)

I have the following (in a more local scope, but in the same Python process):

for premise, hypothesis in list_input:    
tokenized_model_inputs = (
                model.encode(premise, hypothesis, return_tensors="pt", truncation=True)
                .to(self.device)
            )
model.(tokenized_model_inputs)

Then, at a certain moment, when tokenized_model_inputs.shape is torch.Size([1, 957]), I get the CUDA out of memory error.

CUDA out of memory. Tried to allocate 56.00 MiB (GPU 0; 3.81 GiB total capacity; 2.94 GiB already allocated; 23.44 MiB free; 3.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I estimated the size of the tokenized_model_inputs.element_size() * tokenized_model_inputs.nelement() gives 7656 (bytes?).

So I don't see how I'm trying to allocate that many megabytes... I suspect I may be accumulating data unnecessarily.

Am I proceeding correctly with sending model to device at a global scope instead of a more local scope, just before encoding?
Am I batching correctly the inputs?
What else can I do to solve this issue?

Thanks in advance.

Solution

First, there is a big difference between your code and the example from HuggingFace : you should have a Tokenizer (tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')) that is used to encode the input, and then a model, that takes encoded input and provides an output. Moreover, when doing the same thing as you, I am getting an AttributeError that confirms that the model form HuggingFace does not have an encode method. Also, since you do not train the model, you should wrap all the code that uses the model in a with torch.no_grad(): to avoid calculating and storing gradients that you will not use.

The corrected code would look like :

model = AutoModelForSequenceClassification.from_pretrained(
            "facebook/bart-large-mnli"
        )
model = model.to(self.device)
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')

for premise, hypothesis in list_input:    
    tokenized_model_inputs = tokenizer.encode(premise, hypothesis, return_tensors="pt", truncation=True).to(self.device)
    output = model(tokenized_model_inputs)
    # Do what you want with output

Then, there is no batching done in the code you gave, and since the first dimension of the variable you observe is 1, it seems that there is no batching.

Finally, the memory issue you are facing is the fact that the model by itself is on GPU, so it uses by itself about 2.2 GiB GPU memory. The rest of your GPU usage probably comes from other variables.

To fix it, you have a few options :

Use half-precision floats for your model to reduce GPU memory usage with model.half(), but be careful to also use tensor.half() for everything going to your model.
Use your CPU instead : for inference, transformer-based models can be used on CPU without being terribly slow. Same as before, be careful to put everything (model + tensors) on CPU.
Use another GPU with more memory to have the model and tensors in it.