python nlp huggingface-transformers huggingface-tokenizers huggingface

How to fine tune a Huggingface Seq2Seq model with a dataset from the hub?

I want to train the "flax-community/t5-large-wikisplit" model with the "dxiao/requirements-ner-id" dataset. (Just for some experiments)

I think my general procedure is not correct, but I don't know how to go further.

My Code:

Load tokenizer and model:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
checkpoint = "flax-community/t5-large-wikisplit"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint).cuda()

Load dataset that I want to train:

from datasets import load_dataset
raw_dataset = load_dataset("dxiao/requirements-ner-id")

The raw_dataset looks like this ['id', 'tokens', 'tags', 'ner_tags']

I want to get the sentences as sentence and not as tokens.

def tokenToString(tokenarray):
  string = tokenarray[0]
  for x in tokenarray[1:]:
    string += " " + x
  return string

def sentence_function(example):
  return {"sentence" :  tokenToString(example["tokens"]),
          "simplefiedSentence" : tokenToString(example["tokens"]).replace("The", "XXXXXXXXXXX")}

wikisplit_req_set = raw_dataset.map(sentence_function)
wikisplit_req_set

I tried to restructure the dataset such that it looks like the wikisplit dataset:

simple1dataset = wikisplit_req_set.remove_columns(['id', 'tags', 'ner_tags', 'tokens']);
complexdataset = wikisplit_req_set.remove_columns(['id', 'tags', 'ner_tags', 'tokens']);
complexdataset["train"] = complexdataset["train"].add_column("simple_sentence_1",simple1dataset["train"]["sentence"]).add_column("simple_sentence_2",simple1dataset["train"]["simplefiedSentence"])
complexdataset["test"] = complexdataset["test"].add_column("simple_sentence_1",simple1dataset["test"]["sentence"]).add_column("simple_sentence_2",simple1dataset["test"]["simplefiedSentence"])
complexdataset["validation"] = complexdataset["validation"].add_column("simple_sentence_1",simple1dataset["validation"]["sentence"]).add_column("simple_sentence_2",simple1dataset["validation"]["simplefiedSentence"])
trainingDataSet = complexdataset.rename_column("sentence", "complex_sentence")
trainingDataSet

Tokenize it:

def tokenize_function(example):
    model_inputs = tokenizer(example["complex_sentence"],truncation=True, padding=True)
    targetS1 = tokenizer(example["simple_sentence_1"],truncation=True, padding=True)
    targetS2 = tokenizer(example["simple_sentence_2"],truncation=True, padding=True)
    model_inputs['simple_sentence_1'] = targetS1['input_ids']
    model_inputs['simple_sentence_2'] = targetS2['input_ids']
    model_inputs['decoder_input_ids'] = targetS2['input_ids']
    return model_inputs

tokenized_datasets = trainingDataSet.map(tokenize_function, batched=True)
tokenized_datasets=tokenized_datasets.remove_columns("complex_sentence")
tokenized_datasets=tokenized_datasets.remove_columns("simple_sentence_1")
tokenized_datasets=tokenized_datasets.remove_columns("simple_sentence_2")
tokenized_datasets=tokenized_datasets.remove_columns("simplefiedSentence")
tokenized_datasets

DataLoader:

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
data_collator

Training:

from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, TrainingArguments, EvalPrediction, DataCollatorWithPadding, Trainer

bleu = evaluate.load("bleu")

training_args = Seq2SeqTrainingArguments(
  output_dir = "/",
  log_level = "error",
  num_train_epochs = 0.25,
  learning_rate = 5e-4,
  lr_scheduler_type = "linear",
  warmup_steps = 50,
  optim = "adafactor",
  weight_decay = 0.01,
  per_device_train_batch_size = 1,
  per_device_eval_batch_size = 1,
  gradient_accumulation_steps = 16,
  evaluation_strategy = "steps",
  eval_steps = 50,
  predict_with_generate=True,
  generation_max_length = 128,
  save_steps = 500,
  logging_steps = 10,
  push_to_hub = False,
  auto_find_batch_size=True
)

trainer = Seq2SeqTrainer(
    model,
    training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=bleu,

)
trainer.train()

The Problem is, that I do not understand how the model know the expected value and how it calculate its loss. Can someone give me some ideas what happens where?

I hope some one can help me understand my own code, because the documentation by Hugging Face does not help me enough. Maybe someone have some Codeexamples or something else. I do not completely understand how I fine tune the model and how I get the parameters the model expects to train it. I also do not understand how the training works and what the parameters do.

Solution

TL;DR

Take some time to go through https://huggingface.co/course/ or read the https://www.oreilly.com/library/view/natural-language-processing/9781098136789/

After that, you would have answered most of the questions you're having.

Show me the code: Scroll down the bottom of the answer =)

What is a `datasets.Dataset` and `datasets.DatasetDict`?

TL;DR, basically we want to look through it and give us a dictionary of keys of name of the tensors that the model will consume, and the values are actual tensors so that the models can uses in its .forward() function.

In code, you want the processed dataset to be able to do this:

from datasets import load_dataset

ds = load_dataset(...)
ds.map(func_to_preprocess)

for data in ds:
    model(data)  # Does a forward propagation pass.

Why can't I just feed the `Dataset` into the model directly?

It's because the individual datasets creators/maintainers are not necessary the ones that create the models.

And keeping them independent makes sense since a dataset can be used by different model and each model requires different datasets to be preprocessed/"munge"/"manipulated" to the format that it expects (kind of like the Extract, Transform, Load (ETL) process in transformers-based models).

Unless explicitly preprocessed, most datasets are in raw text (str) and annotation/label format, which usually are of these types:

single token decoder output (single token label),
- e.g. Language ID task [in]: Hallo Welt and [out]: de
- normally uses AutoModelForSequenceClassification
regression float output
- e.g. Textual Similarity [in]: Hello world <sep> Foo bar and [out]: 32.12
- normally uses AutoModelForSequenceClassification
free-form autoregressive decoder output (a natural text sentence, i.e. a list of tokens)
- e.g. Machine Translation [in]: Hallo Welt and [out]: Hello World
- normally uses AutoModelForSeq2SeqLM
fixed tokens decoder output (a list of labels)
- e.g. BIO anntoations [in]: Obama is the president and [out]: ['B-PER', 'O', 'O', 'O']
- normally uses AutoModelForTokenClassification

For the dataset you're interested in:

from datasets import load_dataset

raw_dataset = load_dataset("dxiao/requirements-ner-id")
raw_dataset['train'][0]

[out]:

{'id': 0,
 'tokens': ['The',
  'operating',
  'humidity',
  'shall',
  'be',
  'between',
  '0.4',
  'and',
  '0.6'],
 'tags': ['O',
  'B-ATTR',
  'I-ATTR',
  'O',
  'B-ACT',
  'B-RELOP',
  'B-QUANT',
  'O',
  'B-QUANT'],
 'ner_tags': [0, 3, 4, 0, 1, 5, 7, 0, 7]}

But the model doesn't understand inputs and outputs, it only understand torch.tensor objects, hence you need to do some processing.

So tokenizers usually expects raw strings, not list of tokens.

Normally, a model's tokenizer converts raw strings into a list of token ids,

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel

model_name = "flax-community/t5-large-wikisplit"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokenizer(["hello world", "foo bar is a sentence", "fizz buzz"])

[out]:

{'input_ids': [[21820, 296, 1], [5575, 32, 1207, 19, 3, 9, 7142, 1], [361, 5271, 15886, 1]], 'attention_mask': [[1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1]]}

But my dataset comes pre-tokenized? So what do I do?

sentences = [
['The', 'operating','humidity','shall','be','between','0.4','and','0.6'],
['The', 'CIS', 'CNET', 'shall', 'accommodate', 'a', 'bandwidth', 'of', 'at', 'least', '24.0575', 'Gbps', 'to', 'the', 'Computer', 'Room', '.']
]

[tokenizer.convert_tokens_to_ids(sent) for sent in sentences]

[out]:

[[634, 2, 2, 2, 346, 24829, 22776, 232, 22787],
 [634, 21134, 2, 2, 2, 9, 2, 858, 144, 2, 2, 2, 235, 532, 2, 2, 5]]

Why are there so many tokens with index 2?

Because they are unknowns. If we take a look at the vocab,

>>> tokenizer.convert_tokens_to_ids(tokenizer.unk_token)
2

Then how do I encode the tags or new tokens?

Here's an example:

from itertools import chain

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
from datasets import load_dataset


model_name = "flax-community/t5-large-wikisplit"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

raw_dataset = load_dataset("dxiao/requirements-ner-id")



# Get the NER tags.
tag_set = list(map(str, set(chain(*raw_dataset['train']['tags']))))

# Put them into the tokenizer.
tokenizer.add_special_tokens({'additional_special_tokens': tag_set})

train_datset = raw_dataset['train'].map(lambda x: 
    {'input_ids': tokenizer.convert_tokens_to_ids(x['tokens']),
     'labels': tokenizer.convert_tokens_to_ids(x['tags'])}
)


valid_datset = raw_dataset['validation'].map(lambda x: 
    {'input_ids': tokenizer.convert_tokens_to_ids(x['tokens']), 
     'labels': tokenizer.convert_tokens_to_ids(x['tags'])}
)

How to train a Seq2Seq using the text inputs and the NER labels as the outputs?

TL;DR:

from itertools import chain

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset
import evaluate

model_name = "flax-community/t5-large-wikisplit"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)



raw_dataset = load_dataset("dxiao/requirements-ner-id")


# Get the NER tags.
tag_set = list(map(str, set(chain(*raw_dataset['train']['tags']))))

# Put them into the tokenizer.
tokenizer.add_special_tokens({'additional_special_tokens': tag_set})

train_data = raw_dataset['train'].map(lambda x: 
    {'input_ids': tokenizer.convert_tokens_to_ids(x['tokens']),
     'labels': tokenizer.convert_tokens_to_ids(x['tags'])}
)


valid_data = raw_dataset['validation'].map(lambda x: 
    {'input_ids': tokenizer.convert_tokens_to_ids(x['tokens']), 
     'labels': tokenizer.convert_tokens_to_ids(x['tags'])}
)

# set special tokens, not sure if it's needed but adding them for sanity...
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id


mt_metrics = evaluate.combine(
    ["bleu", "chrf"], force_prefix=True
)

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    
    predictions = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)

    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    references = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    outputs = mt_metrics.compute(predictions=predictions,
                             references=references)

    return outputs

training_args = Seq2SeqTrainingArguments(
    output_dir='./',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_steps=1,
    save_steps=5,
    eval_steps=1,
    max_steps=10,
    evaluation_strategy="steps",
    predict_with_generate=True,
    report_to=None,
    metric_for_best_model="chr_f_score",
    load_best_model_at_end=True
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data.with_format("torch"),
    eval_dataset=valid_data.with_format("torch"),
    compute_metrics=compute_metrics
)

trainer.train()

Hey, something seems fishy, when we train an NER, shouldn't we be using AutoModelForTokenClassification not AutoModelForSeq2SeqLM?

Yeah, but like many things in life, there's many means to get to the same end. So in this case, you can take the liberty and be creative to do, e.g.

Περίμενε ένα λεπτό! (Wait a minute!) That's not what I want to do!

I guess you don't really want to do NER but the lessons learnt from munging the corpus with additional tokens and the .map functions should help what you need.

Why don't you just tell me how to manipulate the `DatasetDict` so that it fits what I need?!

Alright, alright. Here goes...

First, I guess you would need to clarify in your question what task are you tackling on top of what model and dataset you're using.

From your code, I am guessing you are trying to build a model for

Task: Text simplification
- [in]: This is super long sentence that has lots of no meaning words.
- [out]: This is a long-winded sentence.
Model: Seq2Seq
- Using AutoModelForSeq2SeqLM("flax-community/t5-large-wikisplit")
Dataset: Texts from dxiao/requirements-ner-id
- [in]: ['The', 'operating','humidity','shall','be',...,]
- [out]: 'The humidity is high'
- Only the input tokens from dxiao/requirements-ner-id are use as input texts, everything else in the dataset is not needed
Preprocessing: Convert the input into a simplified version
- [in]: ['The', 'operating','humidity','shall','be',...,]
- [out]: ['The', 'XXXXX', 'humidity', ...]
- Convert the simplified output and original inputs to input_ids and labels (that the model expects)
- Lets create a random_xxx function for this purpose.

def random_xxx(tokens):
    # Pick out 3 tokens to XXX.
    to_xxx = set(random.sample(range(len(tokens)), 3))
    tokens = []
    for i, tok in enumerate(tokens):
        if i in to_xxx:
            tokens.append('<xxx>')
        else:
            tokens.append(tok)
    return tokens

So how do I do what you listed above? Stop stalling, just give me the code...

from itertools import chain
import random

import os
os.environ["WANDB_DISABLED"] = "true"

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset
import evaluate

model_name = "flax-community/t5-large-wikisplit"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


def random_xxx(tokens):
    # Pick out 3 tokens to XXX.
    to_xxx = set(random.sample(range(len(tokens)), 3))
    tokens = []
    for i, tok in enumerate(tokens):
        if i in to_xxx:
            tokens.append('<xxx>')
        else:
            tokens.append(tok)
    return tokens

raw_dataset = load_dataset("dxiao/requirements-ner-id")

# Put '<xxx>' into the tokenizer.
tokenizer.add_special_tokens({'additional_special_tokens': ['<xxx>']})

# Assuming `input_ids` is "complex" original sentence.
# and `labels` is "simplified" sentence with XXX

train_data = raw_dataset['train'].map(lambda x: 
    {'input_ids': tokenizer(" ".join(x['tokens']),
                            max_length=40, truncation=True, padding="max_length")["input_ids"],
     'labels': tokenizer(" ".join(random_xxx(x['tokens'])), 
                         max_length=40, truncation=True, padding="max_length")["input_ids"]}
)


valid_data = raw_dataset['validation'].map(lambda x: 
    {'input_ids': tokenizer(" ".join(x['tokens']),
                            max_length=40, truncation=True, padding="max_length")["input_ids"],
     'labels': tokenizer(" ".join(random_xxx(x['tokens'])), 
                         max_length=40, truncation=True, padding="max_length")["input_ids"]}
)
    
# set special tokens, not sure if it's needed but adding them for sanity...
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id


mt_metrics = evaluate.combine(
    ["bleu", "chrf"], force_prefix=True
)

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions
    
    predictions = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)

    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    references = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    outputs = mt_metrics.compute(predictions=predictions,
                             references=references)

    return outputs

training_args = Seq2SeqTrainingArguments(
    output_dir='./',
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    logging_steps=1,
    save_steps=5,
    eval_steps=1,
    max_steps=10,
    evaluation_strategy="steps",
    predict_with_generate=True,
    report_to=None,
    metric_for_best_model="chr_f_score",
    load_best_model_at_end=True
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=train_data.with_format("torch"),
    eval_dataset=valid_data.with_format("torch"),
    compute_metrics=compute_metrics
)

trainer.train()

Here's a few other tutorials that will I find helpful:

Take some time to go through https://huggingface.co/course/ or read the https://www.oreilly.com/library/view/natural-language-processing/9781098136789/
- It would really help answer most of the questions you'll have and be more confident that you understand what you're typing in code.
https://huggingface.co/docs/datasets/tutorial
https://www.kaggle.com/code/alvations/from-datasets-import-datasetdict
https://discuss.huggingface.co/t/how-to-efficiently-convert-a-large-parallel-corpus-to-a-huggingface-dataset-to-train-an-encoderdecodermodel/24788
https://www.kaggle.com/code/alvations/huggingface-earlystopping-callbacks

How to fine tune a Huggingface Seq2Seq model with a dataset from the hub?

TL;DR

What is a datasets.Dataset and datasets.DatasetDict?

Why can't I just feed the Dataset into the model directly?

So tokenizers usually expects raw strings, not list of tokens.

But my dataset comes pre-tokenized? So what do I do?

Why are there so many tokens with index 2?

Then how do I encode the tags or new tokens?

How to train a Seq2Seq using the text inputs and the NER labels as the outputs?

Περίμενε ένα λεπτό! (Wait a minute!) That's not what I want to do!

Why don't you just tell me how to manipulate the DatasetDict so that it fits what I need?!

So how do I do what you listed above? Stop stalling, just give me the code...

What is a `datasets.Dataset` and `datasets.DatasetDict`?

Why can't I just feed the `Dataset` into the model directly?

Why don't you just tell me how to manipulate the `DatasetDict` so that it fits what I need?!