Search code examples
pythonpytorchhuggingface-transformers

RuntimeError: mse_cuda not implemented for Long when training a transformer.Trainer


I'm attempting to train a HuggingFace Trainer but seeing the following error:

RuntimeError: "mse_cuda" not implemented for 'Long' when training a transformer.Trainer

I've tried this in multiple cloud environments (CPU & GPU) with no luck. The dataset (tok_dds) is of the following shape and type, and I've ensured there are no NULL values.

Dataset({
    features: ['label', 'title', 'text', 'input', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 5000
})

{'label': int,
 'title': str,
 'text': str,
 'input': str,
 'input_ids': list,
 'token_type_ids': list,
 'attention_mask': list}

I have defined my loss functions as below:

def corr(x,y): return np.corrcoef(x,y)[0][1]
def corr_d(eval_pred): return {'pearson': corr(*eval_pred)}

However, when attempting to Train the model_nm = 'microsoft/deberta-v3-small' on the train/test split of my dataset. I see the following error:

dds = tok_ds.train_test_split(0.25, seed=42)
tokz = AutoTokenizer.from_pretrained(model_nm)
model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=1)
trainer = Trainer(model, args, train_dataset=dds['train'], eval_dataset=dds['test'],
                  tokenizer=tokz, compute_metrics=corr_d)
...
...
File /shared-libs/python3.9/py/lib/python3.9/site-packages/torch/nn/functional.py:3280, in mse_loss(input, target, size_average, reduce, reduction)
   3277     reduction = _Reduction.legacy_get_string(size_average, reduce)
   3279 expanded_input, expanded_target = torch.broadcast_tensors(input, target)
-> 3280 return torch._C._nn.mse_loss(expanded_input, expanded_target, _Reduction.get_enum(reduction))
RuntimeError: "mse_cuda" not implemented for 'Long' when training a transformer.Trainer

Here are the args passed into the Trainer if it's relevant:

args = TrainingArguments('outputs', learning_rate=lr, warmup_ratio=0.1, lr_scheduler_type='cosine', fp16=True,
    evaluation_strategy="epoch", per_device_train_batch_size=bs, per_device_eval_batch_size=bs*2,
    num_train_epochs=epochs, weight_decay=0.01, report_to='none')

Here's is what I think may be relevant environment information

!python --version
Python 3.9.13

!pip list
Package                       Version
----------------------------- ------------
...
transformers                  4.21.1
huggingface-hub               0.8.1
pandas                        1.2.5
protobuf                      3.19.4
scikit-learn                  1.1.1
tensorflow                    2.9.1
torch                         1.12.0

Can anyone point me in the right direction to solve this problem?


Solution

  • Changing the datatype of the labels column from int to float solved this issue for me. If your Dataset is from a pandas DataFrame, you can change the datatype of the column before passing the dataframe to a Dataset.