I'm working on an annotation task of named entities in a text corpus. I found guidelines in the document 1999 Named Entity Recognition Task Definition. In that document, there are guidelines that pertain to titles of persons, in particular the following one: Titles such as “Mr.” and role names such as “President” are not considered part of a person name. For example, in “Mr. Harry Schearer” or “President Harry Schearer”, only Harry Schearer should be tagged as person.
In the Stanford NER though, there are many examples of including titles in the person tag (Captain Weston, Mr. Perry, etc). See here an example of gazette that they give. In their view of person tags, it seems that even “Mrs. and Miss Bates” should be tagged as a person.
Question: what is the most generally accepted guideline?
If you download Stanford CoreNLP 3.5.2 from here: http://nlp.stanford.edu/software/corenlp.shtml
and run this command:
java -Xmx6g -cp "*:." edu.stanford.nlp.pipeline.StanfordCoreNLP -ssplit.eolonly -annotators tokenize,ssplit,pos,lemma,ner -file ner_examples.txt -outputFormat text
(assuming you put some sample sentences, one sentence per line in ner_examples.txt)
the tagged tokens will be shown in: ner_examples.txt.out
You can try out some sentences and see how our current NER system handles different situations. This system is trained on data that does not have titles tagged as PERSON, so our current system in general does not tag the titles as PERSON.