nlp training-data named-entity-recognition spacy

Spacy 2.0 NER Training

In SpacyV1 it was possible to train the NER model by providing a document and a list of entity annotations in BILOU format.

However it seems as if in V2 training is only possible by providing entity annotation like this (7, 13, 'LOC'), so with enity offsets and entity tag.

Is the old way of providing the list of tokens and another list of entity tags in BILOU format still valid?

From what I gather from the documentation it looks like the nlp.update method accepts a list of GoldParse objects so I could create a GoldParse Object for each doc and pass the BILOU tags to its entities attribute. However would I loose important information by ignoring the other attributes of the GoldParse class (e.g. heads or tags https://spacy.io/api/goldparse ) or are the other attributes not needed for training the NER?

Thanks!

Solution

Yes, you can still create GoldParse objects with the BILUO tags. The main reason the usage examples show the "simpler" offset format is that it makes them slightly easier to read and understand.

If you only want to train the NER, you can now also use the nlp.disable_pipes() context manager and disable all other pipeline components (e.g. the 'tagger' and 'parser') during training. After the block, the components will be restored, so when you save out the model, it will include the whole pipeline. You can see this in action in the NER training examples.