I am interested in using pre-trained models from Hugging Face for named entity recognition (NER) tasks without further training or testing of the model.
On the model page of Hugging Face, the only information for reusing the model are as follows:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
I tried the following code, but I am getting a tensor output instead of class labels for each named entity.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModel.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
text = "my text for named entity recognition here."
input_ids = torch.tensor(tokenizer.encode(text, padding=True, truncation=True,max_length=50, add_special_tokens = True)).unsqueeze(0)
with torch.no_grad():
output = model(input_ids, output_attentions=True)
Any suggestions on how to apply the model on a text for NER?
In transformers
NER is done with the TokenClassificationPipeLine:
from transformers import AutoTokenizer, pipeline, AutoModelForTokenClassification
tokenizer = AutoTokenizer.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
model = AutoModelForTokenClassification.from_pretrained("emilyalsentzer/Bio_ClinicalBERT")
nerpipeline = pipeline('ner', model=model, tokenizer=tokenizer)
text = "my text for named entity recognition here."
nerpipeline(text)
Output:
[{'word': 'my',
'score': 0.5209763050079346,
'entity': 'LABEL_0',
'index': 1,
'start': 0,
'end': 2},
{'word': 'text',
'score': 0.5161970257759094,
'entity': 'LABEL_0',
'index': 2,
'start': 3,
'end': 7},
{'word': 'for',
'score': 0.5297629237174988,
'entity': 'LABEL_1',
'index': 3,
'start': 8,
'end': 11},
{'word': 'named',
'score': 0.5258920788764954,
'entity': 'LABEL_1',
'index': 4,
'start': 12,
'end': 17},
{'word': 'entity',
'score': 0.5415489673614502,
'entity': 'LABEL_1',
'index': 5,
'start': 18,
'end': 24},
{'word': 'recognition',
'score': 0.5396601557731628,
'entity': 'LABEL_1',
'index': 6,
'start': 25,
'end': 36},
{'word': 'here',
'score': 0.5165827870368958,
'entity': 'LABEL_0',
'index': 7,
'start': 37,
'end': 41},
{'word': '.',
'score': 0.5266348123550415,
'entity': 'LABEL_0',
'index': 8,
'start': 41,
'end': 42}]
Please note that you need to use AutoModelForTokenClassification
instead of AutoModel
and that not all models have a trained head for token classification, i.e. you will get random weights for the token classification head :)