nlp huggingface-transformers transformer-model roberta-language-model

Getting an embedded output from huggingface transformers

To compare different paragraphs, I am trying to use a transformer model, fitting each paragraph onto the model and then in the end I intend to compare the outputs and see which paragraph has the most similarity.

For this purpose, I am using Roberta-base model. I first used roberta tokenizer on a paragraph. Then I used the roberta model on that tokenized output. But the process is failing due to lack of memory. Even 25GB ram is not enough to complete the process for the paragraphs with 1324 lines.

Any idea how can I make it better or any suggestion what mistakes i might be doing?

from transformers import RobertaTokenizer, RobertaModel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")

model = RobertaModel.from_pretrained("roberta-base").to(device)

inputs = tokenizer(dict_anrika['Anrika'], return_tensors="pt", truncation=True, 
padding=True).to(device)
outputs = model(**inputs)

Solution

Sound like you gave the model input of shape [1324, longest_length_in_batch], which is huge. I tried [1000, 512] input, and found even 200GB RAM server also hits OOM.

One solution is to break the huge input into smaller batches, for example 10 lines at a time.