nlp huggingface-transformers bert-language-model huggingface-datasets squad

How to understand the answer_start parameter of Squad dataset for training BERT-QA model + practical implications for creating custom dataset?

I am in the process of creating a custom dataset to benchmark the accuracy of the 'bert-large-uncased-whole-word-masking-finetuned-squad' model for my domain, to understand if I need to fine-tune further, etc.

When looking at the different Question Answering datasets on the Hugging Face site (squad, adversarial_qa, etc. ), I see that the answer is commonly formatted as a dictionary with keys: answer (the text) and answer_start (char index where answer starts).

I'm trying to understand:

The intuition behind how the model uses the answer_start when calculating the loss, accuracy, etc.
If I need to go through the process of adding this to my custom dataset (easier to run model evaluation code, etc?)
If so, is there a programmatic way to do this to avoid manual effort?

Any help or direction would be greatly appreciated!

Code example to show format:

import datasets
ds = datasets.load_dataset('squad')
train = ds['train']
print('Example: \n')
print(train['answers'][0])

Solution

Your question is a bit broad to give you a specific answer, but I will try my best to point you in some directions.

The intuition behind how the model uses the answer_start when calculating the loss, accuracy, etc.

There are different types of QA tasks/datasets. The ones you mentioned (SQuAD and adversarial_qa) belong to the field of extractive question answering. There, a model must select a span from a given context that answers the given question. For example:

context = 'Second, Democrats have always elevated their minority floor leader to the speakership upon reclaiming majority status. Republicans have not always followed this leadership succession pattern. In 1919, for instance, Republicans bypassed James R. Mann, R-IL, who had been minority leader for eight years, and elected Frederick Gillett, R-MA, to be Speaker. Mann "had angered many Republicans by objecting to their private bills on the floor;" also he was a protégé of autocratic Speaker Joseph Cannon, R-IL (1903–1911), and many Members "suspected that he would try to re-centralize power in his hands if elected Speaker." More recently, although Robert H. Michel was the Minority Leader in 1994 when the Republicans regained control of the House in the 1994 midterm elections, he had already announced his retirement and had little or no involvement in the campaign, including the Contract with America which was unveiled six weeks before voting day.'
question='How did Republicans feel about Mann in 1919?' 
answer='angered' #-> starting at character 365

A simple approach that is often used today, is a linear layer that predicts the answer start and answer end from the last hidden state of a transformer encoder (code example). The last hidden state holds one vector for each input token (token!= words) and the linear layer is trained to assign high probabilities to tokens that could potentially be the start and end of the answer span. To train a model with your data, the loss function needs to know which tokens should get a high probability (i.e. the answer and the start token).

If I need to go through the process of adding this to my custom dataset (easier to run model evaluation code, etc?)

You should go through this process, otherwise, how should someone know where the answer starts in your context? They can of course interfere with it programmatically, but what if your answer string appears twice in the context? Providing an answer start position avoids confusion and allows your users to use it right away with one of the many extractive questions answering scripts that are already available out there.

If so, is there a programmatic way to do this to avoid manual effort?

You could simply loop through your dataset and use str.find:

context.find(answer)

Output: