I'm trying to use SimpleTransformers default setup to do multi-task learning.
I am using the example from their website here
The code looks like below:
import logging
import pandas as pd
from simpletransformers.t5 import T5Model, T5Args
logging.basicConfig(level=logging.INFO)
transformers_logger = logging.getLogger("transformers")
transformers_logger.setLevel(logging.WARNING)
train_data = [
["binary classification", "Anakin was Luke's father" , 1],
["binary classification", "Luke was a Sith Lord" , 0],
["generate question", "Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon", "Who created the Star Wars franchise?"],
["generate question", "Anakin was Luke's father" , "Who was Luke's father?"],
]
train_df = pd.DataFrame(train_data)
train_df.columns = ["prefix", "input_text", "target_text"]
eval_data = [
["binary classification", "Leia was Luke's sister" , 1],
["binary classification", "Han was a Sith Lord" , 0],
["generate question", "In 2020, the Star Wars franchise's total value was estimated at US$70 billion, and it is currently the fifth-highest-grossing media franchise of all time.", "What is the total value of the Star Wars franchise?"],
["generate question", "Leia was Luke's sister" , "Who was Luke's sister?"],
]
eval_df = pd.DataFrame(eval_data)
eval_df.columns = ["prefix", "input_text", "target_text"]
model_args = T5Args()
model_args.num_train_epochs = 200
model_args.no_save = True
model_args.evaluate_generated_text = False
model_args.evaluate_during_training = False
model_args.evaluate_during_training_verbose = False
model_args.use_multiprocessing = False
model_args.use_multiprocessing_for_evaluation = False
model = T5Model("t5", "t5-base", args=model_args)
def count_matches(labels, preds):
print(labels)
print(preds)
return sum([1 if label == pred else 0 for label, pred in zip(labels, preds)])
model.train_model(train_df, show_running_loss=True)
I'm not even using the eval_df
(though I plan on using it in my real code) at the moment because it wasn't setup properly in their code. In this super simple setup I would think that the library would just work. However after trying on two systems (one Windows, one Linux, both latest version of SimpleTransformers) I get the following error:
File "C:\Users\name\AppData\Local\Programs\Python\Python38\lib\site-packages\simpletransformers\t5\t5_utils.py", line 175, in <listcomp>
preprocess_data(d) for d in tqdm(data, disable=args.silent)
File "C:\Users\name\AppData\Local\Programs\Python\Python38\lib\site-packages\simpletransformers\t5\t5_utils.py", line 81, in preprocess_data
batch = tokenizer.prepare_seq2seq_batch(
File "C:\Users\name\AppData\Local\Programs\Python\Python38\lib\site-packages\transformers\tokenization_utils_base.py", line 3282, in prepare_seq2seq_batch
labels = self(
File "C:\Users\name\AppData\Local\Programs\Python\Python38\lib\site-packages\transformers\tokenization_utils_base.py", line 2262, in __call__
raise ValueError(
ValueError: text input must of type `str` (single example), `List[str]` (batch or single pretokenized example) or `List[List[str]]` (batch of pretokenized examples).
I'm using the exact setup and all of the input DataFrames
have strings in them.
Can anyone help figure out why this basic setup fails? Thanks.
In the example code if you change
train_data = [
["binary classification", "Anakin was Luke's father" , 1],
["binary classification", "Luke was a Sith Lord" , 0],
["generate question", "Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon", "Who created the Star Wars franchise?"],
["generate question", "Anakin was Luke's father" , "Who was Luke's father?"],
]
to
train_data = [
["binary classification", "Anakin was Luke's father" , '1'],
["binary classification", "Luke was a Sith Lord" , '0'],
["generate question", "Star Wars is an American epic space-opera media franchise created by George Lucas, which began with the eponymous 1977 film and quickly became a worldwide pop-culture phenomenon", "Who created the Star Wars franchise?"],
["generate question", "Anakin was Luke's father" , "Who was Luke's father?"],
]
The error no longer happens - so it's due to the labels that are not of type str
.