I am trying to extract the NER from the text using .NET Framework and StanFord NER Model. I have a text like
Hello, I am John Doe. Body Mass index is 27. And Body Surface Area is 2.3m.
For this i did create tsv file to train the model. Which is as under:
Hello O
, O
I O
am O
John PERSON
Doe. PERSON
Body BMI
Mass BMI
index BMI
is O
27. O
And O
Body O
Surface O
Area O
is O
2.3m. O
prop file is as under
trainFileList = train/standford_train.tsv
serializeTo = dummy-ner-model-eng.ser.gz
map = word=0,answer=1
useClassFeature=true
useWord=true
useNGrams=true
noMidNGrams=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
useDisjunctive=true
and using below java command
java -mx1g -cp stanford-ner.jar;lib/* edu.stanford.nlp.ie.crf.CRFClassifier -annotators 'tokenize,ssplit,pos,lemma,ner,regexner' -prop train/prop.txt
So, the problem i am facing is Body with tagging BMI is coming two times because of repetition in Body Mass Index and Body Surface Area.
Is there any way that i can omit this second body tag?
You'll need to produce more training data that has examples with Body
not labeled as BMI
. If you are only looking for specific patterns, you might get better results with a rule-based approach. There are tools for rule based NER building in Stanford CoreNLP.
More info: https://stanfordnlp.github.io/CoreNLP/tokensregex.html