I have roughly 2 million sentences that I want to turn into vectors using Facebook AI's RoBERTa-large,fine-tuned on NLI and STSB for sentence similarity (using the awesome sentence-transformers package).
I already have a dataframe with two columns: "utterance" containing each sentence from the corpus, and "report" containing, for each sentence, the title of the document from which it is from.
From there, my code is the following:
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens')
print("Embedding sentences")
data = pd.read_csv("data/sentences.csv")
sentences = data['utterance'].tolist()
sentence_embeddings = []
for sent in tqdm(sentences):
embedding = model.encode([sent])
sentence_embeddings.append(embedding[0])
data['vector'] = sentence_embeddings
Right now, tqdm estimates that the whole process will take around 160 hours on my computer, which is more than I can spare.
Is there any way I could speed this up by changing my code? Is creating a huge list in memory then appending it to the dataframe the best way to proceed here? (I suspect not).
Many thanks in advance!
I found a ridiculous speedup using this package by feeding in the utterances as a list instead of looping over the list. I assume there is some nice internal vectorisation going on.
%timeit utterances_enc = model.encode(utterances[:10])
3.07 s ± 53.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit utterances_enc = [model.encode(utt) for utt in utterances[:10]]
4min 1s ± 8.08 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
The full code would be as follows:
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens')
print("Embedding sentences")
data = pd.read_csv("data/sentences.csv")
sentences = data['utterance'].tolist()
sentence_embeddings = model.encode(sentences)
data['vector'] = sentence_embeddings