python machine-learning huggingface-transformers

GPU utilization almost always 0 during training Hugging Face Transformer

I am fine-tuning a Donut Cord-v2 model with my invoice data which is around 360 GB in size when preprocessed and saved on disk as a dataset. I am following this notebook almost exactly, except I have 6 training epochs instead of 3.

I am training on single Nvidia H100 SXM GPU / Intel Xeon® Gold 6448Y / 128 GB RAM.

Whenever I start training, and inspect CPU and GPU utilization using htop and nvidia-smi, I see that CPU is at 10-12% utilization, used by python, GPU memory is almost 90% filled constantly, but GPU Utilization is almost always 0. If I keep refreshing the output of nvidia-smi, once every 10-12 seconds the utilization will jump to 100% and then go back to 0 immediately. I cant help but feel ther eis a bottleneck between my CPU and GPU, where CPU attempts to constantly process data and send it to GPU, GPU processes it very fast, and just idles, awaiting for the next batch from cpu. I load already pre-processed dataset from disk like so:

from datasets import load_from_disk
processed_dataset = load_from_disk(r"/dataset/dataset_final")

My processor config is as follows:

from transformers import DonutProcessor

new_special_tokens = [] # new tokens which will be added to the tokenizer
task_start_token = "<s>"  # start of task token
eos_token = "</s>" # eos token of tokenizer

processor = DonutProcessor.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

# add new special tokens to tokenizer
processor.tokenizer.add_special_tokens({"additional_special_tokens": new_special_tokens + [task_start_token] + [eos_token]})

# we update some settings which differ from pretraining; namely the size of the images + no rotation required
processor.feature_extractor.size = [1200,1553] # should be (width, height)
processor.feature_extractor.do_align_long_axis = False

My model config is:

import torch
from transformers import VisionEncoderDecoderModel, VisionEncoderDecoderConfig

#print(torch.cuda.is_available())

# Load model from huggingface.co
model = VisionEncoderDecoderModel.from_pretrained("naver-clova-ix/donut-base-finetuned-cord-v2")

# Resize embedding layer to match vocabulary size
new_emb = model.decoder.resize_token_embeddings(len(processor.tokenizer))
print(f"New embedding size: {new_emb}")
# Adjust our image size and output sequence lengths
model.config.encoder.image_size = processor.feature_extractor.size[::-1] # (height, width)
model.config.decoder.max_length = len(max(processed_dataset["train"]["labels"], key=len))

# Add task token for decoder to start
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.decoder_start_token_id = processor.tokenizer.convert_tokens_to_ids(['<s>'])[0]

And my training code is:

import gc
gc.collect()

torch.cuda.empty_cache()


from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer

import logging
logging.basicConfig(level=logging.INFO)

# Arguments for training
training_args = Seq2SeqTrainingArguments(
    output_dir=r"/trained",  # Specify a local directory to save the model
    num_train_epochs=6,
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    weight_decay=0.01,
    fp16=True,
    logging_steps=50,
    save_total_limit=2,
    evaluation_strategy="no",
    save_strategy="epoch",
    predict_with_generate=True,
    report_to="none",
    # Disable push to hub
    push_to_hub=False
   
)

# Create Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=processed_dataset["train"],
)


# Start training
trainer.train()

The estimated time to complete the training with 6 epochs, with 360 GB dataset, is 54 hours. When I run the same exact code on my PC that has Intel i9 11900KF / RTX 3050, I see GPU utilization constantly at 100%. Is there a bottleneck in my code? Why does CPU keep processing so much on already preprocessed dataset? Cuda 12.6

Edit:

Does it make sense to change the dataloader_num_workers parameter of Seq2SeqTrainer to >0 value, since my RAM and CPU core count allows it? (and since CPU utilization is at 10-12% max.)

Solution

You seem to have an IO bottleneck. It means the data cannot be transfered fast enough and your GPU ends up waiting for the data most of the time. You can verify that claim by checking the status of the Python workers in htop.

You do not seem to have a CPU bottleneck because your CPU isn't fully used.

This often happens on VMs when the data is being transfered using old protocols like NFS. If the VM you're using has a local disk, you can try copying the data there before the training, and point your huggingface dataset to that local path. This could also be due to a suboptimal configuration of the data loading process. You might want to give this a read.

You might not be seeing this issue on your PC because:

Your GPU is slower than an H100 hence takes more time to process a single batch. As a result, your system has more time to load the next batch.
Your data is stored in your local disk and therefore the time to load the data is much smaller.

And yes, please increase your number of workers, it can drastically improve the performance.