Search code examples
machine-learninghuggingface-transformers

Huggingface Trainer only doing 3 epochs no matter the TrainingArguments


Im new at machine learning and I'm facing an issue where I want to increase the epochs for training but .train() will only do 3 epochs. What am I doing wrong?

This is my dataset:

> DatasetDict({ train: Dataset({ features: [‘text’, ‘label’], num_rows:
> 85021 }) test: Dataset({ features: [‘text’, ‘label’], num_rows: 15004
> }) })

and its features:

> {‘label’: ClassLabel(num_classes=20, names=[‘01. AGRI’, ‘02. ALIM’,
> ‘03. CHEMFER’, ‘04. ATEX’, ‘05. MACH’, ‘06. MARNAV’, ‘07. CONST’, ‘08.
> MINES’, “09. DOM”, ‘10. TRAN’, ‘11. ARARTILL’, ‘12. PREELEC’, ‘13.
> CER’, ‘14. ACHIMI’, ‘15. ECLA’, ‘16. HABI’, ‘17. ANDUS’, ‘18. ARBU’,
> ‘19. CHIRUR’, ‘20. ARPA’], id=None), ‘text’: Value(dtype=‘string’,
> id=None)}

My Trainer:

trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets[“train”],
eval_dataset=tokenized_datasets[“test”],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)

what my .train() is showing:

***** Running training ***** Num examples = 85021 Num Epochs = 3 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 1 Total optimization steps = 31884

|Epoch|Training Loss|Validation Loss|Accuracy| |1|0.994300|0.972638|0.711610| |2|0.825400|0.879027|0.736337| |3|0.660800|0.893457|0.744401|

I would like to continue training beyond the 3 epochs to increase my accuracy and continue to decrease training and validation loss. I tried changing the num_train_epochs=10 as you can see but nothing changes.

This is largely my code:

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=10,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
  logging_steps=10,
)

### Metrics
from datasets import load_metric
metric = load_metric("accuracy")
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

### Trainer
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

Solution

  • I found the issue. I had in my code twice defined training_args. The second time was right before the trainer and thus the trainer was reading the args from the one definition where I did not write in the option for several epochs. Code should be:

        training_args = TrainingArguments(
            output_dir='./results',          # output directory
            num_train_epochs=10,              # total number of training epochs
            per_device_train_batch_size=8,  # batch size per device during training
            per_device_eval_batch_size=16,   # batch size for evaluation
            warmup_steps=500,                # number of warmup steps for learning rate scheduler
            weight_decay=0.01,               # strength of weight decay
            logging_dir='./logs',            # directory for storing logs
          logging_steps=10,
        )
        
        ### Metrics
        from datasets import load_metric
        metric = load_metric("accuracy")
        def compute_metrics(eval_pred):
            logits, labels = eval_pred
            predictions = np.argmax(logits, axis=-1)
            return metric.compute(predictions=predictions, references=labels)
    

    After this part you can call the trainer.