Search code examples
nlpstanford-nlpnamed-entity-recognition

How to train Stanford NLP NER Extraction model to skip the repeating words?


I am trying to extract the NER from the text using .NET Framework and StanFord NER Model. I have a text like

Hello, I am John Doe. Body Mass index is 27. And Body Surface Area is 2.3m.

For this i did create tsv file to train the model. Which is as under:

Hello   O
,   O
I   O
am  O
John    PERSON
Doe.    PERSON
Body    BMI
Mass    BMI
index   BMI
is  O
27. O
And O
Body    O
Surface O
Area    O
is  O
2.3m.   O

prop file is as under

trainFileList = train/standford_train.tsv
serializeTo = dummy-ner-model-eng.ser.gz
map = word=0,answer=1

useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true

and using below java command

java -mx1g -cp stanford-ner.jar;lib/* edu.stanford.nlp.ie.crf.CRFClassifier -annotators 'tokenize,ssplit,pos,lemma,ner,regexner' -prop train/prop.txt

So, the problem i am facing is Body with tagging BMI is coming two times because of repetition in Body Mass Index and Body Surface Area.

Is there any way that i can omit this second body tag?


Solution

  • You'll need to produce more training data that has examples with Body not labeled as BMI. If you are only looking for specific patterns, you might get better results with a rule-based approach. There are tools for rule based NER building in Stanford CoreNLP.

    More info: https://stanfordnlp.github.io/CoreNLP/tokensregex.html