I'm writing a inference script for already trained NER model, but I have trouble with converting encoded tokens (their ids) into original words.
# example input
df = pd.DataFrame({'_id': [1], 'body': ['Amazon and Tesla are currently the best picks out there!']})
# calling method that handles inference:
ner_model = NER()
ner_model.recognize_from_df(df, 'body')
# here is only part of larger NER class that handles the inference:
def recognize_from_df(self, df: pd.DataFrame, input_col: str):
predictions = []
df = df[['_id', input_col]].copy()
dataset = Dataset.from_pandas(df)
# tokenization, padding, truncation:
encoded_dataset = dataset.map(lambda examples: self.bert_tokenizer(examples[input_col],
padding='max_length', truncation=True, max_length=512), batched=True)
encoded_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'], device=device)
dataloader = torch.utils.data.DataLoader(encoded_dataset, batch_size=32)
encoded_dataset_ids = encoded_dataset['_id']
for batch in dataloader:
output = self.model(**batch)
# decoding predictions and tokens
for i in range(batch['input_ids'].shape[0]):
tags = [self.unique_labels[label_id] for label_id in output[i]]
tokens = [t for t in self.bert_tokenizer.convert_ids_to_tokens(batch['input_ids'][i]) if t != '[PAD]']
...
The results are close to what I need:
# tokens:
['[CLS]', 'am', '##az', '##on', 'and', 'te', '##sla', 'are', 'currently', 'the', 'best', 'picks', 'out', 'there', ...]
# tags:
['X', 'B-COMPANY', 'X', 'X', 'O', 'B-COMPANY', 'X', 'O', 'O', 'O', 'O', 'O', 'O', 'O', ...]
How to combine 'am', '##az', '##on'
and 'B-COMPANY', 'X', 'X'
into one token/tag? I know that there is a method called convert_tokens_to_string
in Tokenizer, but it returns just one big string, which is hard to map to tag.
Regards
Provided you only want to "merge" company names one could do that in a linear time with pure Python.
Skipping the beginning of sentence token [CLS]
for brevity:
tokens = tokens[1:]
tags = tags[1:]
The function below will merge company tokens and increase pointer appropriately:
def merge_company(tokens, tags):
generated_tokens = []
i = 0
while i < len(tags):
if tags[i] == "B-COMPANY":
company_token = [tokens[i]]
for j in range(i + 1, len(tags)):
i += 1
if tags[j] != "X":
break
else:
company_token.append(tokens[j][2:])
generated_tokens.append("".join(company_token))
else:
generated_tokens.append(tokens[i])
i += 1
return generated_tokens
Usage is pretty simple, please notice tags
need their X
s removed as well though:
tokens = merge_company(tokens, tags)
tags = [tag for tag in tags if tag != "X"]
This would give you:
['amazon', 'and', 'tesla', 'are', 'currently', 'the', 'best', 'picks', 'out', 'there']
['B-COMPANY', 'O', 'B-COMPANY', 'O', 'O', 'O', 'O', 'O', 'O', 'O']