I'm running various example code from HuggingFace docs, and a variable examples
repeatedly appears in these tutorials.
Many functions have it as a formal parameter:
def preprocess_function(examples):
questions = [q.strip() for q in examples["question"]]
....
Yet when these functions are called, no input is passed in:
tokenized_squad = squad.map(preprocess_function, batched=True, ....)
This example from the Question Answering tutorial.
What exactly is examples
, and why doesn't it need to be passed in when the function is called?
To help you understand, we first have to clarify that map
does not have the called function preprocess_function()
as an argument but the function itself preprocess_function
. So basically what map does is:
squad
)preprocess_function
So what this means is that map
will process the whole dataset squad
in chunks (batches) of 1000 items at a time. So for every batch (a sublist of examples from the dataset) the function is called. Very simplified, under the hood, map does something like this:
squad
in sublists (batches of 1000 examples each)As a reference point, it might be useful for you to read up on the built-in map
in Python, which works similarly on iterables. It is also a concept that returns in other programming languages so it is good to know!