I am retraining the Stanford NER system to extract Technology names and Organization names from Text.
If I want to retrain the stanford ner model, we should give the training data in the format:
She O
works O
on O
C# TECHNOLOGY
at O
New ORGANIZATION
York ORGANIZATION
TImes ORGANIZATION
and O
Microsoft ORGANIZATION
in O
New LOCATION
York LOCATION
Is it sufficient to just specify the named entities in this manner ? Do we need to specify the part of speech information in some format when we retrain a model ? Also, if we have entities that are multi word, then is this the correct way to annotate them ?
This is the approach I followed : Is the approach right ?
Used this command from the FAQs of stanford ner :
java -cp stanford-ner.jar edu.stanford.nlp.ie.crf.CRFClassifier -prop austen.prop
Yes, it is sufficient to just annotate entities as shown. Stanford NER also supports training classifiers that use POS information at classification time, but it adds very little in accuracy when other techniques like distributional similarity word clusters are used. IN the models we distribute, we don't use POS information, for simplicity (and so NER can be run alone without the POS tagger).
For annotating multiple word entities, there are several strategies. The above encoding is sometimes called "IO" (inside outside). It has the advantages of being simple to interpret, and fast, but the disadvantage of not allowing two adjacent entities of the same class to be distinguished -- you have to assume that a run of words of the same category comprise one big entity. We use it by default, because it is simple and fast, despite that disadvantage. (Adjacent entities of the same category occur very rarely for person/organization/location, but they can be much more common in some other domains.)
But you can also annotate data and train a model using other encoding schemes such as IOB, which has the opposite properties (more complicated, tagger runs slower, can represent adjacent entities of the same category). See this SO question for details.