python nlp huggingface-transformers bert-language-model huggingface-tokenizers

BERT get sentence embedding

I am replicating code from this page. I have downloaded the BERT model to my local system and getting sentence embedding.

I have around 500,000 sentences for which I need sentence embedding and it is taking a lot of time.

Is there a way to expedite the process?
Would sending batches of sentences rather than one sentence at a time help?

#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

corpa=["i am a boy","i live in a city"]



storage=[]#list to store all embeddings

for text in corpa:
    # Add the special tokens.
    marked_text = "[CLS] " + text + " [SEP]"

    # Split the sentence into tokens.
    tokenized_text = tokenizer.tokenize(marked_text)

    # Map the token strings to their vocabulary indeces.
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)

    segments_ids = [1] * len(tokenized_text)

    tokens_tensor = torch.tensor([indexed_tokens])
    segments_tensors = torch.tensor([segments_ids])

    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():

        outputs = model(tokens_tensor, segments_tensors)

        # Evaluating the model will return a different number of objects based on 
        # how it's  configured in the `from_pretrained` call earlier. In this case, 
        # becase we set `output_hidden_states = True`, the third item will be the 
        # hidden states from all layers. See the documentation for more details:
        # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
        hidden_states = outputs[2]


    # `hidden_states` has shape [13 x 1 x 22 x 768]

    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = hidden_states[-2][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)

    storage.append((text,sentence_embedding))

######update 1

I modified my code based upon the answer provided. It is not doing full batch processing

#!pip install transformers
import torch
import transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased',
                                  output_hidden_states = True, # Whether the model returns all hidden-states.
                                  )

# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()

batch_sentences = ["Hello I'm a single sentence",
                    "And another sentence",
                    "And the very very last one"]
encoded_inputs = tokenizer(batch_sentences)


storage=[]#list to store all embeddings
for i,text in enumerate(encoded_inputs['input_ids']):
    
    tokens_tensor = torch.tensor([encoded_inputs['input_ids'][i]])
    segments_tensors = torch.tensor([encoded_inputs['attention_mask'][i]])
    print (tokens_tensor)
    print (segments_tensors)

    # Run the text through BERT, and collect all of the hidden states produced
    # from all 12 layers. 
    with torch.no_grad():

        outputs = model(tokens_tensor, segments_tensors)

        # Evaluating the model will return a different number of objects based on 
        # how it's  configured in the `from_pretrained` call earlier. In this case, 
        # becase we set `output_hidden_states = True`, the third item will be the 
        # hidden states from all layers. See the documentation for more details:
        # https://huggingface.co/transformers/model_doc/bert.html#bertmodel
        hidden_states = outputs[2]


    # `hidden_states` has shape [13 x 1 x 22 x 768]

    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = hidden_states[-2][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)
    print (sentence_embedding[:10])
    storage.append((text,sentence_embedding))

I could update first 2 lines from the for loop to below. But they work only if all sentences have same length after tokenization

tokens_tensor = torch.tensor([encoded_inputs['input_ids']])
segments_tensors = torch.tensor([encoded_inputs['attention_mask']])

moreover in that case outputs = model(tokens_tensor, segments_tensors) fails.

How could I fully perform batch processing in such case?

Solution

About your original question: there is not much you can do. BERT is pretty computationally demanding algorithm. Your best shot is to use BertTokenizerFast instead of the regular BertTokenizer. The "fast" version is much more efficient and you will see the difference for large amounts of text.

Saying that, I have to warn you that averaging BERT word embeddings does not create good embeddings for the sentence. See this post. From your questions I assume you want to do some kind of semantic similarity search. Try using one of those open-sourced models.