mt5 fine-tuning does not use gpu(volatile gpu utill 0%)
Hi, im trying to fine tuning for ko-en translation with mt5-base model. I think the Cuda setting was done correctly(cuda available is True) But during training, the training set doesn't use GPU except getting dataset first(very short time).
I want to use GPU resource efficiently and get advice about translation model fine-tuning here is my code and training env.
import logging
import pandas as pd
from simpletransformers.t5 import T5Model, T5Args
import torch
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
train_df = pd.read_csv("data/enko_train.tsv", sep="\t").astype(str)
eval_df = pd.read_csv("data/enko_eval.tsv", sep="\t").astype(str)
train_df["prefix"] = ""
eval_df["prefix"] = ""
model_args = T5Args()
model_args.max_seq_length = 96
model_args.train_batch_size = 64
model_args.eval_batch_size = 32
model_args.num_train_epochs = 10
model_args.evaluate_during_training = True
model_args.evaluate_during_training_steps = 1000
model_args.use_multiprocessing = False
model_args.fp16 = True
model_args.save_steps = 1000
model_args.save_eval_checkpoints = True
model_args.no_cache = True
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True
model_args.preprocess_inputs = False
model_args.num_return_sequences = 1
model_args.wandb_project = "MT5 Korean-English Translation"
print("Is cuda available?", torch.cuda.is_available())
model = T5Model("mt5", "google/mt5-base", cuda_device=0 , args=model_args)
# Train the model
model.train_model(train_df, eval_data=eval_df)
# Optional: Evaluate the model. We'll test it properly anyway.
results = model.eval_model(eval_df, verbose=True)
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_May__3_19:15:13_PDT_2021
Cuda compilation tools, release 11.3, V11.3.109
Build cuda_11.3.r11.3/compiler.29920130_0
gpu 0 = Quadro RTX 6000
it jus out of memory cases. The parameter and dataset weren't loaded on my gpu memory. so i changed my model mt5-base to mt5-small, delete save point, reduce dataset