I want to train the "flax-community/t5-large-wikisplit"
model with the "dxiao/requirements-ner-id"
dataset. (Just for some experiments)
I think my general procedure is not correct, but I don't know how to go further.
My Code:
Load tokenizer and model:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
checkpoint = "flax-community/t5-large-wikisplit"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint).cuda()
Load dataset that I want to train:
from datasets import load_dataset
raw_dataset = load_dataset("dxiao/requirements-ner-id")
The raw_dataset looks like this ['id', 'tokens', 'tags', 'ner_tags']
I want to get the sentences as sentence and not as tokens.
def tokenToString(tokenarray):
string = tokenarray[0]
for x in tokenarray[1:]:
string += " " + x
return string
def sentence_function(example):
return {"sentence" : tokenToString(example["tokens"]),
"simplefiedSentence" : tokenToString(example["tokens"]).replace("The", "XXXXXXXXXXX")}
wikisplit_req_set = raw_dataset.map(sentence_function)
wikisplit_req_set
I tried to restructure the dataset such that it looks like the wikisplit dataset:
simple1dataset = wikisplit_req_set.remove_columns(['id', 'tags', 'ner_tags', 'tokens']);
complexdataset = wikisplit_req_set.remove_columns(['id', 'tags', 'ner_tags', 'tokens']);
complexdataset["train"] = complexdataset["train"].add_column("simple_sentence_1",simple1dataset["train"]["sentence"]).add_column("simple_sentence_2",simple1dataset["train"]["simplefiedSentence"])
complexdataset["test"] = complexdataset["test"].add_column("simple_sentence_1",simple1dataset["test"]["sentence"]).add_column("simple_sentence_2",simple1dataset["test"]["simplefiedSentence"])
complexdataset["validation"] = complexdataset["validation"].add_column("simple_sentence_1",simple1dataset["validation"]["sentence"]).add_column("simple_sentence_2",simple1dataset["validation"]["simplefiedSentence"])
trainingDataSet = complexdataset.rename_column("sentence", "complex_sentence")
trainingDataSet
Tokenize it:
def tokenize_function(example):
model_inputs = tokenizer(example["complex_sentence"],truncation=True, padding=True)
targetS1 = tokenizer(example["simple_sentence_1"],truncation=True, padding=True)
targetS2 = tokenizer(example["simple_sentence_2"],truncation=True, padding=True)
model_inputs['simple_sentence_1'] = targetS1['input_ids']
model_inputs['simple_sentence_2'] = targetS2['input_ids']
model_inputs['decoder_input_ids'] = targetS2['input_ids']
return model_inputs
tokenized_datasets = trainingDataSet.map(tokenize_function, batched=True)
tokenized_datasets=tokenized_datasets.remove_columns("complex_sentence")
tokenized_datasets=tokenized_datasets.remove_columns("simple_sentence_1")
tokenized_datasets=tokenized_datasets.remove_columns("simple_sentence_2")
tokenized_datasets=tokenized_datasets.remove_columns("simplefiedSentence")
tokenized_datasets
DataLoader:
from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
data_collator
Training:
from transformers import Seq2SeqTrainer, Seq2SeqTrainingArguments, TrainingArguments, EvalPrediction, DataCollatorWithPadding, Trainer
bleu = evaluate.load("bleu")
training_args = Seq2SeqTrainingArguments(
output_dir = "/",
log_level = "error",
num_train_epochs = 0.25,
learning_rate = 5e-4,
lr_scheduler_type = "linear",
warmup_steps = 50,
optim = "adafactor",
weight_decay = 0.01,
per_device_train_batch_size = 1,
per_device_eval_batch_size = 1,
gradient_accumulation_steps = 16,
evaluation_strategy = "steps",
eval_steps = 50,
predict_with_generate=True,
generation_max_length = 128,
save_steps = 500,
logging_steps = 10,
push_to_hub = False,
auto_find_batch_size=True
)
trainer = Seq2SeqTrainer(
model,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
compute_metrics=bleu,
)
trainer.train()
The Problem is, that I do not understand how the model know the expected value and how it calculate its loss. Can someone give me some ideas what happens where?
I hope some one can help me understand my own code, because the documentation by Hugging Face does not help me enough. Maybe someone have some Codeexamples or something else. I do not completely understand how I fine tune the model and how I get the parameters the model expects to train it. I also do not understand how the training works and what the parameters do.
Take some time to go through https://huggingface.co/course/ or read the https://www.oreilly.com/library/view/natural-language-processing/9781098136789/
After that, you would have answered most of the questions you're having.
Show me the code: Scroll down the bottom of the answer =)
datasets.Dataset
and datasets.DatasetDict
?TL;DR, basically we want to look through it and give us a dictionary of keys of name of the tensors that the model will consume, and the values are actual tensors so that the models can uses in its .forward()
function.
In code, you want the processed dataset to be able to do this:
from datasets import load_dataset
ds = load_dataset(...)
ds.map(func_to_preprocess)
for data in ds:
model(data) # Does a forward propagation pass.
Dataset
into the model directly?It's because the individual datasets creators/maintainers are not necessary the ones that create the models.
And keeping them independent makes sense since a dataset can be used by different model and each model requires different datasets to be preprocessed/"munge"/"manipulated" to the format that it expects (kind of like the Extract, Transform, Load (ETL) process in transformers-based models).
Unless explicitly preprocessed, most datasets are in raw text (str
) and annotation/label format, which usually are of these types:
[in]: Hallo Welt
and [out]: de
AutoModelForSequenceClassification
[in]: Hello world <sep> Foo bar
and [out]: 32.12
AutoModelForSequenceClassification
[in]: Hallo Welt
and [out]: Hello World
AutoModelForSeq2SeqLM
[in]: Obama is the president
and [out]: ['B-PER', 'O', 'O', 'O']
AutoModelForTokenClassification
For the dataset you're interested in:
from datasets import load_dataset
raw_dataset = load_dataset("dxiao/requirements-ner-id")
raw_dataset['train'][0]
[out]:
{'id': 0,
'tokens': ['The',
'operating',
'humidity',
'shall',
'be',
'between',
'0.4',
'and',
'0.6'],
'tags': ['O',
'B-ATTR',
'I-ATTR',
'O',
'B-ACT',
'B-RELOP',
'B-QUANT',
'O',
'B-QUANT'],
'ner_tags': [0, 3, 4, 0, 1, 5, 7, 0, 7]}
But the model doesn't understand inputs and outputs, it only understand torch.tensor
objects, hence you need to do some processing.
Normally, a model's tokenizer converts raw strings into a list of token ids,
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
model_name = "flax-community/t5-large-wikisplit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer(["hello world", "foo bar is a sentence", "fizz buzz"])
[out]:
{'input_ids': [[21820, 296, 1], [5575, 32, 1207, 19, 3, 9, 7142, 1], [361, 5271, 15886, 1]], 'attention_mask': [[1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1]]}
sentences = [
['The', 'operating','humidity','shall','be','between','0.4','and','0.6'],
['The', 'CIS', 'CNET', 'shall', 'accommodate', 'a', 'bandwidth', 'of', 'at', 'least', '24.0575', 'Gbps', 'to', 'the', 'Computer', 'Room', '.']
]
[tokenizer.convert_tokens_to_ids(sent) for sent in sentences]
[out]:
[[634, 2, 2, 2, 346, 24829, 22776, 232, 22787],
[634, 21134, 2, 2, 2, 9, 2, 858, 144, 2, 2, 2, 235, 532, 2, 2, 5]]
Because they are unknowns. If we take a look at the vocab,
>>> tokenizer.convert_tokens_to_ids(tokenizer.unk_token)
2
Here's an example:
from itertools import chain
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
from datasets import load_dataset
model_name = "flax-community/t5-large-wikisplit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
raw_dataset = load_dataset("dxiao/requirements-ner-id")
# Get the NER tags.
tag_set = list(map(str, set(chain(*raw_dataset['train']['tags']))))
# Put them into the tokenizer.
tokenizer.add_special_tokens({'additional_special_tokens': tag_set})
train_datset = raw_dataset['train'].map(lambda x:
{'input_ids': tokenizer.convert_tokens_to_ids(x['tokens']),
'labels': tokenizer.convert_tokens_to_ids(x['tags'])}
)
valid_datset = raw_dataset['validation'].map(lambda x:
{'input_ids': tokenizer.convert_tokens_to_ids(x['tokens']),
'labels': tokenizer.convert_tokens_to_ids(x['tags'])}
)
TL;DR:
from itertools import chain
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset
import evaluate
model_name = "flax-community/t5-large-wikisplit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
raw_dataset = load_dataset("dxiao/requirements-ner-id")
# Get the NER tags.
tag_set = list(map(str, set(chain(*raw_dataset['train']['tags']))))
# Put them into the tokenizer.
tokenizer.add_special_tokens({'additional_special_tokens': tag_set})
train_data = raw_dataset['train'].map(lambda x:
{'input_ids': tokenizer.convert_tokens_to_ids(x['tokens']),
'labels': tokenizer.convert_tokens_to_ids(x['tags'])}
)
valid_data = raw_dataset['validation'].map(lambda x:
{'input_ids': tokenizer.convert_tokens_to_ids(x['tokens']),
'labels': tokenizer.convert_tokens_to_ids(x['tags'])}
)
# set special tokens, not sure if it's needed but adding them for sanity...
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
mt_metrics = evaluate.combine(
["bleu", "chrf"], force_prefix=True
)
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
predictions = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = tokenizer.pad_token_id
references = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
outputs = mt_metrics.compute(predictions=predictions,
references=references)
return outputs
training_args = Seq2SeqTrainingArguments(
output_dir='./',
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
logging_steps=1,
save_steps=5,
eval_steps=1,
max_steps=10,
evaluation_strategy="steps",
predict_with_generate=True,
report_to=None,
metric_for_best_model="chr_f_score",
load_best_model_at_end=True
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_data.with_format("torch"),
eval_dataset=valid_data.with_format("torch"),
compute_metrics=compute_metrics
)
trainer.train()
Hey, something seems fishy, when we train an NER, shouldn't we be using AutoModelForTokenClassification
not AutoModelForSeq2SeqLM
?
Yeah, but like many things in life, there's many means to get to the same end. So in this case, you can take the liberty and be creative to do, e.g.
I guess you don't really want to do NER but the lessons learnt from munging the corpus with additional tokens and the .map
functions should help what you need.
DatasetDict
so that it fits what I need?!Alright, alright. Here goes...
First, I guess you would need to clarify in your question what task are you tackling on top of what model and dataset you're using.
From your code, I am guessing you are trying to build a model for
[in]: This is super long sentence that has lots of no meaning words.
[out]: This is a long-winded sentence.
AutoModelForSeq2SeqLM("flax-community/t5-large-wikisplit")
dxiao/requirements-ner-id
[in]: ['The', 'operating','humidity','shall','be',...,]
[out]: 'The humidity is high'
dxiao/requirements-ner-id
are use as input texts, everything else in the dataset is not needed[in]: ['The', 'operating','humidity','shall','be',...,]
[out]: ['The', 'XXXXX', 'humidity', ...]
input_ids
and labels
(that the model expects)random_xxx
function for this purpose.def random_xxx(tokens):
# Pick out 3 tokens to XXX.
to_xxx = set(random.sample(range(len(tokens)), 3))
tokens = []
for i, tok in enumerate(tokens):
if i in to_xxx:
tokens.append('<xxx>')
else:
tokens.append(tok)
return tokens
from itertools import chain
import random
import os
os.environ["WANDB_DISABLED"] = "true"
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModel
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from datasets import load_dataset
import evaluate
model_name = "flax-community/t5-large-wikisplit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def random_xxx(tokens):
# Pick out 3 tokens to XXX.
to_xxx = set(random.sample(range(len(tokens)), 3))
tokens = []
for i, tok in enumerate(tokens):
if i in to_xxx:
tokens.append('<xxx>')
else:
tokens.append(tok)
return tokens
raw_dataset = load_dataset("dxiao/requirements-ner-id")
# Put '<xxx>' into the tokenizer.
tokenizer.add_special_tokens({'additional_special_tokens': ['<xxx>']})
# Assuming `input_ids` is "complex" original sentence.
# and `labels` is "simplified" sentence with XXX
train_data = raw_dataset['train'].map(lambda x:
{'input_ids': tokenizer(" ".join(x['tokens']),
max_length=40, truncation=True, padding="max_length")["input_ids"],
'labels': tokenizer(" ".join(random_xxx(x['tokens'])),
max_length=40, truncation=True, padding="max_length")["input_ids"]}
)
valid_data = raw_dataset['validation'].map(lambda x:
{'input_ids': tokenizer(" ".join(x['tokens']),
max_length=40, truncation=True, padding="max_length")["input_ids"],
'labels': tokenizer(" ".join(random_xxx(x['tokens'])),
max_length=40, truncation=True, padding="max_length")["input_ids"]}
)
# set special tokens, not sure if it's needed but adding them for sanity...
model.config.eos_token_id = tokenizer.eos_token_id
model.config.pad_token_id = tokenizer.pad_token_id
mt_metrics = evaluate.combine(
["bleu", "chrf"], force_prefix=True
)
def compute_metrics(pred):
labels_ids = pred.label_ids
pred_ids = pred.predictions
predictions = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
labels_ids[labels_ids == -100] = tokenizer.pad_token_id
references = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)
outputs = mt_metrics.compute(predictions=predictions,
references=references)
return outputs
training_args = Seq2SeqTrainingArguments(
output_dir='./',
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
logging_steps=1,
save_steps=5,
eval_steps=1,
max_steps=10,
evaluation_strategy="steps",
predict_with_generate=True,
report_to=None,
metric_for_best_model="chr_f_score",
load_best_model_at_end=True
)
trainer = Seq2SeqTrainer(
model=model,
args=training_args,
train_dataset=train_data.with_format("torch"),
eval_dataset=valid_data.with_format("torch"),
compute_metrics=compute_metrics
)
trainer.train()
Here's a few other tutorials that will I find helpful: