Search code examples
pythonpandashuggingface-datasets

how to change function for huggingface datasets to custom dataset


So, the function 'preprocess_function' below is made for huggingface datasets.

from datasets import load_dataset, load_metric
from transformers import AutoTokenizer

raw_datasets = load_dataset("xsum")
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

max_input_length = 1024
max_target_length = 128

if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    # Setup the tokenizer for targets
    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

I'm not using huggingface datasets but using my own dataset, so I can't use dataset.map() function at the last line of code. So I changed the last line like below using apply function, because my train dataset is just simple pandas dataframe.

tokenized_datasets = train.apply(preprocess_function)

But it's making error message like this.

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-18-ad0e3caaca6d> in <module>()
----> 1 tokenized_datasets = train.apply(preprocess_function)

7 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    386                 except ValueError as err:
    387                     raise KeyError(key) from err
--> 388             raise KeyError(key)
    389         return super().get_loc(key, method=method, tolerance=tolerance)
    390 

KeyError: 'input'

Can someone tell me how I can change this code from raw train to tokenized_train dataset?


Solution

  • You can use a Huggingface dataset by loading it from a pandas dataframe, as shown here Dataset.from_pandas. ds = Dataset.from_pandas(df) should work. This will let you be able to use the dataset map feature.