huggingface-tokenizers huggingface-datasets huggingface

How to use dataset with costume function?

I want to call DatasetDict map function with parameters, and I dont know how to do it.

I have function with the following API:

def tokenize_function(tokenizer, examples):
    s1 = examples["premise"]
    s2 = examples["hypothesis"]
    args = (s1, s2)
    return tokenizer(*args, padding="max_length", truncation=True)

And when I’m trying to use in this way:

   dataset = load_dataset("json", data_files=data_files)
   tokenizer          = AutoTokenizer.from_pretrained(model_name)
   tokenized_datasets = dataset.map(tokenize_function, tokenizer, batched=True)

I’m getting error:

TypeError: list indices must be integers or slices, not str

How can I call map function in my example ?

Solution

Additional parameters, like the tokenizer object, need to be passed by the fn_kwargs parameter of .map function:

from datasets import load_dataset
from transformers import RobertaTokenizer

dataset = load_dataset("anli")
t = RobertaTokenizer.from_pretrained("roberta-base")

def tokenize_function(examples, tokenizer):
    s1 = examples["premise"]
    s2 = examples["hypothesis"]
    args = (s1, s2)
    return tokenizer(*args, padding="max_length", truncation=True)


tokenized_datasets = dataset.map(tokenize_function, batched=True, fn_kwargs={"tokenizer":t})