So, the function 'preprocess_function' below is made for huggingface datasets.
from datasets import load_dataset, load_metric
from transformers import AutoTokenizer
raw_datasets = load_dataset("xsum")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
max_input_length = 1024
max_target_length = 128
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
prefix = "summarize: "
else:
prefix = ""
def preprocess_function(examples):
inputs = [prefix + doc for doc in examples["document"]]
model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)
# Setup the tokenizer for targets
with tokenizer.as_target_tokenizer():
labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)
I'm not using huggingface datasets but using my own dataset, so I can't use dataset.map()
function at the last line of code. So I changed the last line like below using apply
function, because my train
dataset is just simple pandas dataframe.
tokenized_datasets = train.apply(preprocess_function)
But it's making error message like this.
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-18-ad0e3caaca6d> in <module>()
----> 1 tokenized_datasets = train.apply(preprocess_function)
7 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
386 except ValueError as err:
387 raise KeyError(key) from err
--> 388 raise KeyError(key)
389 return super().get_loc(key, method=method, tolerance=tolerance)
390
KeyError: 'input'
Can someone tell me how I can change this code from raw train to tokenized_train dataset?
You can use a Huggingface dataset by loading it from a pandas dataframe, as shown here Dataset.from_pandas. ds = Dataset.from_pandas(df)
should work. This will let you be able to use the dataset map feature.