python pytorch huggingface-transformers huggingface

Huggingface model training loop has same performance on CPU & GPU? Confused as to why?

Question

I created two Python notebooks to fine-tune BERT on a Yelp review dataset for sentiment analysis. The only difference between the two notebooks is that one runs on a CPU .to("cpu") while the other uses a GPU .to("cuda").

Despite this difference in hardware, the training times for both notebooks are nearly the same. I am new to using Hugging Face, so I'm wondering if there's anything I might be overlooking. Both notebooks are running on a machine with a single GPU.

Metrics for CPU

TrainOutput(global_step=100, training_loss=1.5707407319545745, metrics={'train_runtime': 116.5447, 'train_samples_per_second': 3.432, 'train_steps_per_second': 0.858, 'total_flos': 105247256985600.0, 'train_loss': 1.5707407319545745, 'epoch': 0.4})

{'eval_loss': 1.4039757251739502,
 'eval_accuracy': 0.4,
 'eval_runtime': 3.6833,
 'eval_samples_per_second': 27.15,
 'eval_steps_per_second': 3.529,
 'epoch': 0.4}

# specifically concerned with 'train_samples_per_second': 3.432

Metrics for GPU

TrainOutput(global_step=100, training_loss=1.6277318179607392, metrics={'train_runtime': 115.46, 'train_samples_per_second': 3.464, 'train_steps_per_second': 0.866, 'total_flos': 105247256985600.0, 'train_loss': 1.6277318179607392, 'epoch': 0.4})

{'eval_loss': 1.525576114654541,
 'eval_accuracy': 0.35,
 'eval_runtime': 3.6518,
 'eval_samples_per_second': 27.384,
 'eval_steps_per_second': 3.56,
 'epoch': 0.4}

# specifically concerned with 'train_samples_per_second': 3.464

Solution

I assume that the machine you were using had access to a GPU. The hf trainer will automatically use the GPU if it is available. It is irrelevant that you moved the model to cpu or cuda, the trainer will not check it and move your model to cuda if available. You can turn off the device placement with the TrainingArguments setting no_cuda:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./some_local_dir",
    overwrite_output_dir=True,
    per_device_train_batch_size=4,
    dataloader_num_workers=2,
    max_steps=100,
    logging_steps=1,
    evaluation_strategy="steps",
    eval_steps=5,
    no_cuda=True,
)