Search code examples
nlpbert-language-model

BERT: Unable to reproduce sentence-to-embedding operation


I am trying to convert sentence to embedding, with the following code.

import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM

model = BertModel.from_pretrained('bert-base-uncased')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

text = "[CLS] This is a sentence. [SEP]"
tokens = tokenizer.tokenize(text)
input_ids = torch.tensor([tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))])
encoded_layers, pooled_output = model(input_ids, output_all_encoded_layers=False)

The code worked. However, each time I run this code, it gives a different result. encoded_layers and pooled_output changes every time, for the same input.

Thank you for your help!


Solution

  • Maybe "dropout" works while inferencing. You can try model.eval()

    In addition, "transformers" is Long-Time-Support. Stop using pytorch_pretrained_bert

    import torch
    from transformers import BertTokenizerFast, BertModel
    
    bert_path = "/Users/Caleb/Desktop/codes/ptms/bert-base"
    tokenizer = BertTokenizerFast.from_pretrained(bert_path)
    model = BertModel.from_pretrained(bert_path)
    
    max_length = 32
    test_str = "This is a sentence."
    tokenized = tokenizer(test_str, max_length=max_length, padding="max_length")
    input_ids = tokenized['input_ids']
    input_ids = torch.unsqueeze(torch.LongTensor(input_ids), 0)
    attention_mask = tokenized['attention_mask']
    attention_mask = torch.unsqueeze(torch.IntTensor(attention_mask), 0)
    res = model(input_ids, attention_mask=attention_mask)
    print(res.last_hidden_state)