Search code examples
nlpnltkstanford-nlpopennlpcorpus

When manually tagging a corpus for NLP is it important to have untagged text as well?


I am doing manual tagging to train my own NER Do I have to include untagged text in sentences I am preparing for named entity recognition?

<START:person> Olivier Grisel <END> is working on the <START:software> Stanbol <END> project .

Or can I omit untagged parts like this?

<START:person> Olivier Grisel <END>
<START:software> Stanbol <END>

PS: Thanks for all the great answers. I tried omitting the untagged parts and in that case OpenNLP marked every line as an entity, so it didn't work. As the answers explain, untagged parts are necessary.


Solution

  • If you are doing manual tagging to train your own NER model (it's not 100% clear from your question), you should include the same kind of data you expect to tag later, most likely full sentences. The default model features (see OpenNLP docs) include a window of tokens to the left and right of the token that's currently being considered, so you want your labeled entities to appear in their normal context. You also want your model to learn which words shouldn't be tagged as entities, so they also need to appear in context in your training data.

    See the related question: Open NLP Name Finder Training