Search code examples
performancehuggingface-transformershuggingface-tokenizershuggingface-datasets

HuggingFace's BertTokenizerFast is between 39000 and 258300 times slower than expected


As part of training a BERT model, I am tokenizing a 600MB corpus, which should apparently take approx. 12 seconds. I tried this on a computing cluster and on a Google Colab Pro server, and got time estimates ranging from 130 to 861 hours.

Here's the minimal working example (most of the values aren't hard-coded, but I specified the ones I use most of the time here for simplicity):

training_args = TrainingArguments(
    output_dir=args.output_dir,
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=512,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
    learning_rate=2e-5,
    weight_decay=0.15,
    push_to_hub=False,
    gradient_accumulation_steps=4
)

dataset = load_dataset(
    "text",
    data_files="mycorpus.txt")['train'].shuffle(seed=42)

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased")

# Code stolen from https://github.com/huggingface/notebooks/blob/main/examples/language_modeling.ipynb
# except I replaced the tokenization function with a lambda
tokenized_dataset = dataset.map(
    lambda examples: tokenizer(examples["text"]),
    batched=True,
    num_proc=4,
    remove_columns=["text"])

lm_dataset = tokenized_dataset.map(
    group_texts,
    batched=True,
    batch_size=512,
    num_proc=4
)
# /steal

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

model_trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=lm_dataset
)
model_trainer.train()

Having traced the execution path in PDB, the issue arises in the call to model_trainer.train(), which I guess ends up calling the lambda used in the declaration of tokenized_dataset.

I do get the following message:

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

However, I do believe my lambda calls the __call__ function implicitly, does it not? Can I do something about this?

That said, I'm doubtful that this warning message is relevant, as it seems to imply a relatively minor slowdown. I feel like it would have a more dramatic tone if it were related to the staggering difference that I'm observing.


Solution

  • Turns out, the log message about BertTokenizerFast had nothing to do with the progress bar that appeared right after, which I thought was the tokenization progress bar but was in fact the training progress bar. The actual problem was that the model was training on CPU instead of GPU. I thought I had ruled this out because I had verified that torch.cuda.isAvailable() == True and HuggingFace Trainers are supposed to use CUDA if available. However, the installed version of PyTorch was incorrect for my version of CUDA and despite CUDA being "available", PyTorch refused to use the GPU, making HuggingFace default back to CPU training. All of this was silent and caused no warnings or error messages.