Search code examples
nlphuggingface-transformersnlp-question-answering

Is 'examples' a default output variable for HuggingFace transformers library?


I'm running various example code from HuggingFace docs, and a variable examples repeatedly appears in these tutorials. Many functions have it as a formal parameter:

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    ....

Yet when these functions are called, no input is passed in:

tokenized_squad = squad.map(preprocess_function, batched=True, ....)

This example from the Question Answering tutorial.

What exactly is examples, and why doesn't it need to be passed in when the function is called?


Solution

  • To help you understand, we first have to clarify that map does not have the called function preprocess_function() as an argument but the function itself preprocess_function. So basically what map does is:

    1. For a given dataset (here squad)
    2. Execute a function preprocess_function
    3. On every batch (batched=True with a default batch size of 1000)

    So what this means is that map will process the whole dataset squad in chunks (batches) of 1000 items at a time. So for every batch (a sublist of examples from the dataset) the function is called. Very simplified, under the hood, map does something like this:

    1. Chop squad in sublists (batches of 1000 examples each)
    2. Process each batch, i.e. passing the examples to the function
    3. Put the results of the function back into the resulting dataset.

    As a reference point, it might be useful for you to read up on the built-in map in Python, which works similarly on iterables. It is also a concept that returns in other programming languages so it is good to know!