I want to call DatasetDict
map
function with parameters, and I dont know how to do it.
I have function with the following API:
def tokenize_function(tokenizer, examples):
s1 = examples["premise"]
s2 = examples["hypothesis"]
args = (s1, s2)
return tokenizer(*args, padding="max_length", truncation=True)
And when I’m trying to use in this way:
dataset = load_dataset("json", data_files=data_files)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenized_datasets = dataset.map(tokenize_function, tokenizer, batched=True)
I’m getting error:
TypeError: list indices must be integers or slices, not str
How can I call map
function in my example ?
Additional parameters, like the tokenizer object, need to be passed by the fn_kwargs parameter of .map function:
from datasets import load_dataset
from transformers import RobertaTokenizer
dataset = load_dataset("anli")
t = RobertaTokenizer.from_pretrained("roberta-base")
def tokenize_function(examples, tokenizer):
s1 = examples["premise"]
s2 = examples["hypothesis"]
args = (s1, s2)
return tokenizer(*args, padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True, fn_kwargs={"tokenizer":t})