python pytorch huggingface-transformers huggingface-tokenizers huggingface-datasets

How to convert tokenized words back to the original ones after inference?

I'm writing a inference script for already trained NER model, but I have trouble with converting encoded tokens (their ids) into original words.

# example input
df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks out there!']})

# calling method that handles inference:
ner_model = NER()
ner_model.recognize_from_df(df, 'body')

# here is only part of larger NER class that handles the inference:
def recognize_from_df(self, df: pd.DataFrame, input_col: str):
    predictions = []
    df = df[['_id', input_col]].copy()
    dataset = Dataset.from_pandas(df)
    # tokenization, padding, truncation:
    encoded_dataset = dataset.map(lambda examples: self.bert_tokenizer(examples[input_col], 
                                      padding='max_length', truncation=True, max_length=512), batched=True)
    encoded_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'], device=device)
    dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_size=32)
    encoded_dataset_ids = encoded_dataset['_id']

    for batch in dataloader:
        output = self.model(**batch)
        # decoding predictions and tokens
        for i in range(batch['input_ids'].shape[0]):
            tags = [self.unique_labels[label_id] for label_id in output[i]]
            tokens = [t for t in self.bert_tokenizer.convert_ids_to_tokens(batch['input_ids'][i]) if t != '[PAD]']
        ...

The results are close to what I need:

# tokens:
['[CLS]', 'am', '##az', '##on', 'and', 'te', '##sla', 'are', 'currently', 'the', 'best', 'picks', 'out', 'there', ...]
# tags:
['X', 'B-COMPANY', 'X', 'X', 'O', 'B-COMPANY', 'X', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ...]

How to combine 'am', '##az', '##on' and 'B-COMPANY', 'X', 'X' into one token/tag? I know that there is a method called convert_tokens_to_string in Tokenizer, but it returns just one big string, which is hard to map to tag.

Regards

Solution

Provided you only want to "merge" company names one could do that in a linear time with pure Python.

Skipping the beginning of sentence token [CLS] for brevity:

tokens = tokens[1:]
tags = tags[1:]

The function below will merge company tokens and increase pointer appropriately:

def merge_company(tokens, tags):
    generated_tokens = []
    i = 0
    while i < len(tags):
        if tags[i] == "B-COMPANY":
            company_token = [tokens[i]]
            for j in range(i + 1, len(tags)):
                i += 1
                if tags[j] != "X":
                    break
                else:
                    company_token.append(tokens[j][2:])
            generated_tokens.append("".join(company_token))
        else:
            generated_tokens.append(tokens[i])
            i += 1

    return generated_tokens

Usage is pretty simple, please notice tags need their Xs removed as well though:

tokens = merge_company(tokens, tags)
tags = [tag for tag in tags if tag != "X"]

This would give you:

['amazon', 'and', 'tesla', 'are', 'currently', 'the', 'best', 'picks', 'out', 'there']
['B-COMPANY', 'O', 'B-COMPANY', 'O', 'O', 'O', 'O', 'O', 'O', 'O']