Search code examples
huggingface-tokenizershuggingface-datasetshuggingface

How to use dataset with costume function?


I want to call DatasetDict map function with parameters, and I dont know how to do it.

I have function with the following API:

def tokenize_function(tokenizer, examples):
    s1 = examples["premise"]
    s2 = examples["hypothesis"]
    args = (s1, s2)
    return tokenizer(*args, padding="max_length", truncation=True)

And when I’m trying to use in this way:

   dataset = load_dataset("json", data_files=data_files)
   tokenizer          = AutoTokenizer.from_pretrained(model_name)
   tokenized_datasets = dataset.map(tokenize_function, tokenizer, batched=True)

I’m getting error:

TypeError: list indices must be integers or slices, not str

How can I call map function in my example ?


Solution

  • Additional parameters, like the tokenizer object, need to be passed by the fn_kwargs parameter of .map function:

    from datasets import load_dataset
    from transformers import RobertaTokenizer
    
    dataset = load_dataset("anli")
    t = RobertaTokenizer.from_pretrained("roberta-base")
    
    def tokenize_function(examples, tokenizer):
        s1 = examples["premise"]
        s2 = examples["hypothesis"]
        args = (s1, s2)
        return tokenizer(*args, padding="max_length", truncation=True)
    
    
    tokenized_datasets = dataset.map(tokenize_function, batched=True, fn_kwargs={"tokenizer":t})