Search code examples
pythonpytorchhuggingface-transformershuggingface-tokenizershuggingface-datasets

How to convert tokenized words back to the original ones after inference?


I'm writing a inference script for already trained NER model, but I have trouble with converting encoded tokens (their ids) into original words.

# example input
df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks out there!']})

# calling method that handles inference:
ner_model = NER()
ner_model.recognize_from_df(df, 'body')

# here is only part of larger NER class that handles the inference:
def recognize_from_df(self, df: pd.DataFrame, input_col: str):
    predictions = []
    df = df[['_id', input_col]].copy()
    dataset = Dataset.from_pandas(df)
    # tokenization, padding, truncation:
    encoded_dataset = dataset.map(lambda examples: self.bert_tokenizer(examples[input_col], 
                                      padding='max_length', truncation=True, max_length=512), batched=True)
    encoded_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'], device=device)
    dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_size=32)
    encoded_dataset_ids = encoded_dataset['_id']

    for batch in dataloader:
        output = self.model(**batch)
        # decoding predictions and tokens
        for i in range(batch['input_ids'].shape[0]):
            tags = [self.unique_labels[label_id] for label_id in output[i]]
            tokens = [t for t in self.bert_tokenizer.convert_ids_to_tokens(batch['input_ids'][i]) if t != '[PAD]']
        ...

The results are close to what I need:

# tokens:
['[CLS]', 'am', '##az', '##on', 'and', 'te', '##sla', 'are', 'currently', 'the', 'best', 'picks', 'out', 'there', ...]
# tags:
['X', 'B-COMPANY', 'X', 'X', 'O', 'B-COMPANY', 'X', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ...]

How to combine 'am', '##az', '##on' and 'B-COMPANY', 'X', 'X' into one token/tag? I know that there is a method called convert_tokens_to_string in Tokenizer, but it returns just one big string, which is hard to map to tag.

Regards


Solution

  • Provided you only want to "merge" company names one could do that in a linear time with pure Python.

    Skipping the beginning of sentence token [CLS] for brevity:

    tokens = tokens[1:]
    tags = tags[1:]
    

    The function below will merge company tokens and increase pointer appropriately:

    def merge_company(tokens, tags):
        generated_tokens = []
        i = 0
        while i < len(tags):
            if tags[i] == "B-COMPANY":
                company_token = [tokens[i]]
                for j in range(i + 1, len(tags)):
                    i += 1
                    if tags[j] != "X":
                        break
                    else:
                        company_token.append(tokens[j][2:])
                generated_tokens.append("".join(company_token))
            else:
                generated_tokens.append(tokens[i])
                i += 1
    
        return generated_tokens
    

    Usage is pretty simple, please notice tags need their Xs removed as well though:

    tokens = merge_company(tokens, tags)
    tags = [tag for tag in tags if tag != "X"]
    

    This would give you:

    ['amazon', 'and', 'tesla', 'are', 'currently', 'the', 'best', 'picks', 'out', 'there']
    ['B-COMPANY', 'O', 'B-COMPANY', 'O', 'O', 'O', 'O', 'O', 'O', 'O']