python mapping spacy tokenize bert-language-model

Map BERT token indices to Spacy token indices

I’m trying to make Bert’s (bert-base-uncased) tokenization token indices (not ids, token indices) map to Spacy’s tokenization token indices. In the following example, my approach doesn’t work becos Spacy’s tokenization behaves a bit more complex than I anticipated. Thoughts on solving this?

import spacy
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
nlp = spacy.load("en_core_web_sm")

sent = nlp("BRITAIN'S railways cost £20.7bn during the 2020-21 financial year, with £2.5bn generated through fares and other income, £1.3bn through other sources and £16.9bn from government, figures released by the regulator the Office of Rail and Road (ORR) on November 30 revealed.")
# Get spacy word index to BERT token indice mapping
wd_to_tok_map = [wd.i for wd in sent for el in tokenizer.encode(wd.text, add_special_tokens=False)]
len(sent) # 55
len(wd_to_tok_map) # 67     <- Should be 65

input_ids = tokenizer.encode(sent.text, add_special_tokens=False)
len(input_ids) # 65

I can print both tokenizations and look for perfect text matches, but the problem I run into is what if a word repeats twice in the tokenization? Looking for a word match will return two indices at different sections of the sentence.

[el.text for el in sent]
['BRITAIN', ''S', 'railways', 'cost', '£', '20.7bn', 'during', 'the', '2020', '-', '21', 'financial', 'year', ',', 'with', '£','2.5bn','generated','through', 'fares', 'and','other', 'income', ',', '£', '1.3bn', 'through', 'other', 'sources', 'and', '£', '16.9bn', 'from', 'government', ',', 'figures', 'released', 'by', 'the', 'regulator', 'the', 'Office', 'of', 'Rail', 'and', 'Road', '(', 'ORR', ')', 'on', 'November', '30', 'revealed', '.']

[tokenizer.ids_to_tokens[el] for el in input_ids]
['britain',''', 's', 'railways', 'cost', '£2', '##0', '.', '7', '##bn', 'during', 'the', '2020', '-', '21', 'financial', 'year', ',', 'with', '£2', '.', '5', '##bn', 'generated', 'through', 'fares', 'and', 'other', 'income', ',', '£1', '.', '3', '##bn', 'through', 'other', 'sources', 'and', '£1', '##6', '.', '9', '##bn', 'from', 'government', ',', 'figures', 'released', 'by', 'the', 'regulator', 'the', 'office', 'of', 'rail', 'and', 'road', '(', 'orr', ')', 'on', 'november', '30', 'revealed', '.']

decode() doesn’t seem to give me what I want, as I’m after the indices.

Solution

Use a fast tokenizer to get the character offsets directly from the transformer tokenizer with return_offsets_mapping=True, and then map those to the spacy tokens however you'd like:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
text = "BRITAIN'S railways cost £20.7bn"
output = tokenizer([text], return_offsets_mapping=True)

print(output["input_ids"])
# [[101, 3725, 1005, 1055, 7111, 3465, 21853, 2692, 1012, 1021, 24700, 102]]

print(tokenizer.convert_ids_to_tokens(output["input_ids"][0]))
# ['[CLS]', 'britain', "'", 's', 'railways', 'cost', '£2', '##0', '.', '7', '##bn', '[SEP]']

print(output["offset_mapping"])
# [[(0, 0), (0, 7), (7, 8), (8, 9), (10, 18), (19, 23), (24, 26), (26, 27), (27, 28), (28, 29), (29, 31), (0, 0)]]