Search code examples
pythonnlpdatasetlarge-language-modelbert-language-model

How to convert character indices to BERT token indices


I am working with a question-answer dataset UCLNLP/adversarial_qa.

from datasets import load_dataset
ds = load_dataset("UCLNLP/adversarial_qa", "adversarialQA")

How do I map character-based answer indices to token-based indices after tokenizing the context and question together using a tokenizer like BERT. Here's an example row from my dataset:

d0 = ds['train'][0]
d0

{'id': '7ba1e8f4261d3170fcf42e84a81dd749116fae95',
 'title': 'Brain',
 'context': 'Another approach to brain function is to examine the consequences of damage to specific brain areas. Even though it is protected by the skull and meninges, surrounded by cerebrospinal fluid, and isolated from the bloodstream by the blood–brain barrier, the delicate nature of the brain makes it vulnerable to numerous diseases and several types of damage. In humans, the effects of strokes and other types of brain damage have been a key source of information about brain function. Because there is no ability to experimentally control the nature of the damage, however, this information is often difficult to interpret. In animal studies, most commonly involving rats, it is possible to use electrodes or locally injected chemicals to produce precise patterns of damage and then examine the consequences for behavior.',
 'question': 'What sare the benifts of the blood brain barrir?',
 'answers': {'text': ['isolated from the bloodstream'], 'answer_start': [195]},
 'metadata': {'split': 'train', 'model_in_the_loop': 'Combined'}}

After tokenization, the answer indices are 56 and 16:

from transformers import BertTokenizerFast
bert_tokenizer = BertTokenizerFast.from_pretrained('bert-large-uncased', return_token_type_ids=True)

bert_tokenizer.decode(bert_tokenizer.encode(d0['question'], d0['context'])[56:61])
'isolated from the bloodstream'

I want to create a new dataset with the answer's token indices, e.g., 56 ad 60.

This is from a linkedin learning class. The instructor did the conversion and created the csv file but he did not share it or the code to do that. This is the expected result:QA dataset with token answer indices


Solution

  • You should encode both the question and context, locate the token span for the answer within the tokenized context, and update the dataset with the token-level indices.

    The following function does the above for you:

    def get_token_indices(example):
        # Tokenize with `return_offsets_mapping=True` to get character offsets for each token
        encoded = tokenizer(
            example['question'], 
            example['context'], 
            return_offsets_mapping=True
        )
    
        # Find character start and end from the original answer
        char_start = example['answers']['answer_start'][0]
        char_end = char_start + len(example['answers']['text'][0])
    
        # Identify token indices for the answer
        start_token_idx = None
        end_token_idx = None
        
        for i, (start, end) in enumerate(encoded['offset_mapping']):
            if start <= char_start < end: 
                start_token_idx = i
            if start < char_end <= end:
                end_token_idx = i
                break
    
        example['answer_start_token_idx'] = start_token_idx
        example['answer_end_token_idx'] = end_token_idx
        return example
    

    Here's how you can use and test this function:

    ds = load_dataset("UCLNLP/adversarial_qa", "adversarialQA")
    tokenizer = BertTokenizerFast.from_pretrained('bert-large-uncased', return_token_type_ids=True)
    
    tokenized_ds = ds['train'].map(get_token_indices)
    
    
    # Example
    d0_tokenized = tokenized_ds[0]
    print("Tokenized start index:", d0_tokenized['answer_start_token_idx'])
    print("Tokenized end index:", d0_tokenized['answer_end_token_idx'])
    
    answer_tokens = tokenizer.decode(
        tokenizer.encode(d0_tokenized['question'], d0_tokenized['context'])[d0_tokenized['answer_start_token_idx']:d0_tokenized['answer_end_token_idx']+1]
    )
    print("Tokenized answer:", answer_tokens)
    

    Output:

    Tokenized start index: 56
    Tokenized end index: 60
    Tokenized answer: isolated from the bloodstream