nlp stanford-nlp named-entity-recognition

How to train Stanford NLP NER Extraction model to skip the repeating words?

I am trying to extract the NER from the text using .NET Framework and StanFord NER Model. I have a text like

Hello, I am John Doe. Body Mass index is 27. And Body Surface Area is 2.3m.

For this i did create tsv file to train the model. Which is as under:

Hello   O
,   O
I   O
am  O
John    PERSON
Doe.    PERSON
Body    BMI
Mass    BMI
index   BMI
is  O
27. O
And O
Body    O
Surface O
Area    O
is  O
2.3m.   O

prop file is as under

trainFileList = train/standford_train.tsv
serializeTo = dummy-ner-model-eng.ser.gz
map = word=0,answer=1

useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true

and using below java command

java -mx1g -cp stanford-ner.jar;lib/* edu.stanford.nlp.ie.crf.CRFClassifier -annotators 'tokenize,ssplit,pos,lemma,ner,regexner' -prop train/prop.txt

So, the problem i am facing is Body with tagging BMI is coming two times because of repetition in Body Mass Index and Body Surface Area.

Is there any way that i can omit this second body tag?

Solution

You'll need to produce more training data that has examples with Body not labeled as BMI. If you are only looking for specific patterns, you might get better results with a rule-based approach. There are tools for rule based NER building in Stanford CoreNLP.

More info: https://stanfordnlp.github.io/CoreNLP/tokensregex.html