Search code examples
pythonnlpformatbert-language-modelhuggingface-transformers

How should properly formatted data for NER in BERT look like?


I am using Huggingface's transformers library and want to perform NER using BERT. I tried to find an explicit example of how to properly format the data for NER using BERT. It is not entirely clear to me from the paper and the comments I've found.

Let's say we have a following sentence and labels:

sent = "John Johanson lives in Ramat Gan."
labels = ['B-PER', 'I-PER', 'O', 'O', 'B-LOC', 'I-LOC']

Would data that we input to the model be something like this:

sent = ['[CLS]', 'john', 'johan',  '##son', 'lives',  'in', 'ramat', 'gan', '.', '[SEP]']
labels = ['O', 'B-PER', 'I-PER', 'I-PER', 'O', 'O', 'B-LOC', 'I-LOC', 'O', 'O']
attention_mask = [0, 1, 1, 1, 1, 1, 1, 1, 1, 0]
sentence_id = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

?

Thank you!


Solution

  • Update 2021-08-27: The tutorial link points to a legacy tutorial, which I don't fully recommend anymore, since it does not use Huggingface's convenience library datasets.

    There is actually a great tutorial for the NER example on the huggingface documentation page. Specifically, it also goes into detail how the provided script does the preprocessing. Specifically, there is a link to an external contributor's preprocess.py script, that basically takes the data from the CoNLL 2003 format to whatever is required by the huggingface library. I found this to be the easiest way to assert I have proper formatting, and unless you have some specific changes that you might want to incorporate, this gets you started super quick without worrying about implementation details.

    The linked example script also provides more than enough detail on how to feed the respective inputs into the model itself, but generally, you are correct in your above-mentioned input pattern.