python nlp amazon-sagemaker huggingface-transformers

View train error metrics for Hugging Face Sagemaker model

I have trained a model using Hugging Face's integration with Amazon Sagemaker and their Hello World example.

I can easily calculate and view the metrics generated on the evaluation test set: accuracy, f-score, precision, recall etc. by calling training_job_analytics on the trained model: huggingface_estimator.training_job_analytics.dataframe()

How can I also see the same metrics on training sets (or even training error for each epoch)?

Training code is basically the same as the link with extra parts of the docs added:

from sagemaker.huggingface import HuggingFace

# optionally parse logs for key metrics
# from the docs: https://huggingface.co/docs/sagemaker/train#sagemaker-metrics
metric_definitions = [
    {'Name': 'loss', 'Regex': "'loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'learning_rate', 'Regex': "'learning_rate': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_loss', 'Regex': "'eval_loss': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_accuracy', 'Regex': "'eval_accuracy': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_f1', 'Regex': "'eval_f1': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_precision', 'Regex': "'eval_precision': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_recall', 'Regex': "'eval_recall': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_runtime', 'Regex': "'eval_runtime': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'eval_samples_per_second', 'Regex': "'eval_samples_per_second': ([0-9]+(.|e\-)[0-9]+),?"},
    {'Name': 'epoch', 'Regex': "'epoch': ([0-9]+(.|e\-)[0-9]+),?"}
]

# hyperparameters, which are passed into the training job
hyperparameters={
    'epochs': 5,
    'train_batch_size': batch_size,
    'model_name': model_checkpoint,
    'task': task,
}

# init the model (but not yet trained)
huggingface_estimator = HuggingFace(
    entry_point='train.py',
    source_dir='./scripts',
    instance_type='ml.p3.2xlarge',
    instance_count=1,
    role=role,
    transformers_version='4.6',
    pytorch_version='1.7',
    py_version='py36',
    hyperparameters = hyperparameters,
    metric_definitions=metric_definitions
)
# starting the train job with our uploaded datasets as input
huggingface_estimator.fit({'train': training_input_path, 'test': test_input_path})

# does not return metrics on training - only on eval!
huggingface_estimator.training_job_analytics.dataframe()

Solution

This can be solved by increasing the number of epochs in training to a more realistic value.

Currently, the model trains in fewer than 300 seconds (which is when the following timestamp would be recorded) and presumably the loss function.

Changes to make:

hyperparameters={
    'epochs': 100, # increase the number of epochs to realistic value!
    'train_batch_size': batch_size,
    'model_name': model_checkpoint,
    'task': task,
}