I'm trying to use "facebook/bart-large-mnli"
HuggingFace model to do some 'inference' on some embeddings.
I first do globaly:
model = AutoModelForSequenceClassification.from_pretrained(
"facebook/bart-large-mnli"
)
model = model.to(self.device)
I have the following (in a more local scope, but in the same Python process):
for premise, hypothesis in list_input:
tokenized_model_inputs = (
model.encode(premise, hypothesis, return_tensors="pt", truncation=True)
.to(self.device)
)
model.(tokenized_model_inputs)
Then, at a certain moment, when tokenized_model_inputs.shape
is torch.Size([1, 957])
, I get the CUDA out of memory error.
CUDA out of memory. Tried to allocate 56.00 MiB (GPU 0; 3.81 GiB total capacity; 2.94 GiB already allocated; 23.44 MiB free; 3.00 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I estimated the size of the
tokenized_model_inputs.element_size() * tokenized_model_inputs.nelement()
gives 7656 (bytes?).
So I don't see how I'm trying to allocate that many megabytes... I suspect I may be accumulating data unnecessarily.
Thanks in advance.
First, there is a big difference between your code and the example from HuggingFace : you should have a Tokenizer (tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
) that is used to encode the input, and then a model, that takes encoded input and provides an output. Moreover, when doing the same thing as you, I am getting an AttributeError that confirms that the model form HuggingFace does not have an encode
method.
Also, since you do not train the model, you should wrap all the code that uses the model in a with torch.no_grad():
to avoid calculating and storing gradients that you will not use.
The corrected code would look like :
model = AutoModelForSequenceClassification.from_pretrained(
"facebook/bart-large-mnli"
)
model = model.to(self.device)
tokenizer = AutoTokenizer.from_pretrained('facebook/bart-large-mnli')
for premise, hypothesis in list_input:
tokenized_model_inputs = tokenizer.encode(premise, hypothesis, return_tensors="pt", truncation=True).to(self.device)
output = model(tokenized_model_inputs)
# Do what you want with output
Then, there is no batching done in the code you gave, and since the first dimension of the variable you observe is 1, it seems that there is no batching.
Finally, the memory issue you are facing is the fact that the model by itself is on GPU, so it uses by itself about 2.2 GiB GPU memory. The rest of your GPU usage probably comes from other variables.
To fix it, you have a few options :
model.half()
, but be careful to also use tensor.half()
for everything going to your model.