Search code examples
pythonnlppytorchbert-language-modelhuggingface-transformers

PipelineException: No mask_token ([MASK]) found on the input


I am getting this error "PipelineException: No mask_token ([MASK]) found on the input" when I run this line. fill_mask("Auto Car .")

I am running it on Colab. My Code:

from transformers import BertTokenizer, BertForMaskedLM
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
from transformers import BertTokenizer, BertForMaskedLM


paths = [str(x) for x in Path(".").glob("**/*.txt")]
print(paths)

bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

from transformers import BertModel, BertConfig

configuration = BertConfig()
model = BertModel(configuration)
configuration = model.config
print(configuration)

model = BertForMaskedLM.from_pretrained("bert-base-uncased")

from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
    tokenizer=bert_tokenizer,
    file_path="./kant.txt",
    block_size=128,
)

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(
    tokenizer=bert_tokenizer, mlm=True, mlm_probability=0.15
)

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./KantaiBERT",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
    )

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()

from transformers import pipeline

fill_mask = pipeline(
    "fill-mask",
    model=model,
    tokenizer=bert_tokenizer,
    device=0,
)

fill_mask("Auto Car <mask>.").     # This line is giving me the error...

The last line is giving me the error mentioned above. Please let me know what I am doing wrong or what I have to do in order to remove this error.

Complete error: "f"No mask_token ({self.tokenizer.mask_token}) found on the input","


Solution

  • Even if you have already found the error, a recommendation to avoid it in the future. Instead of calling

    fill_mask("Auto Car <mask>.")
    

    you can do the following to be more flexible when you use different models:

    MASK_TOKEN = tokenizer.mask_token
    
    fill_mask("Auto Car {}.".format(MASK_TOKEN))