We have a report writing tool that we're trying to add a search capability to. Essentially a user would be able to type in a question and get a report back based on the criteria in the sentence. We're trying to keep this as open ended as we can, not requiring a specific sentence structure, which is why we thought to try OpenNLP NER.
An example would be:
"what was Arts attendance last quarter"
Tagged as:
what was <START:dept> Arts <END> <START:filter> attendance <END> last <START:calc> quarter <END>
We've tried to come up with many different variations of the questions with varying departments, filters, etc.. We're still not at 15k only at 14.6k, so still working towards that.
As far as analyzing the question this is the start of it:
InputStream tokenStream = getClass().getResourceAsStream("/en-token.bin"); //$NON-NLS
TokenizerModel tokenModel = new TokenizerModel(tokenStream);
Tokenizer tokenizer = new TokenizerME(tokenModel);
for (String name : modelNames) {
tokenizedQuestion = tokenizer.tokenize(question);
String alteredQuestion = question;
TokenNameFinderModel entityModel = new TokenNameFinderModel(getClass().getResourceAsStream(name));
NameFinderME nameFinder = new NameFinderME(entityModel);
Span[] nameSpans = nameFinder.find(tokenizedQuestion);
for (Span span : nameSpans) {
if (span.getType().equals("dept")) {
deptList.add(span);
} else if (span.getType().equals("filter")) {
filterList.add(span);
} else if (span.getType().equals("calculation"){
calculationList.add(span);
}
}
The problem now is if you put in "what was Bugs Bunny last cartoon" you get 'Bugs' as a dept, 'Bunny' as a filter, and 'cartoon' as a calculation.
I'm guessing our training questions are to similar to each other and now it's assuming whatever follows "what was" is a department.
1. Is that a correct assumption and is there a better way of training these models?
2. Is the best bet to break each entity into it's own model? I did try this and had 105 unit tests that failed afterwards so, hoping to try something simpler first, lol.
Also I have read multiple threads on here about custom NER models, but most of what I've found is how to start one. There's also a thread about how multiple entity models don't work. I forget where the post was I found that putting null in for the type allows you to tag multiple types in the same model and it seems to work fairly well.
tokenNameFinderModel = NameFinderME.train("en", null, sampleStream, TrainingParameters.defaultParams(), new TokenNameFinderFactory());
tokenNameFinderModel.serialize(modelOut);
Thanks in advance for any and all help!!
Our end goal was to be able to train a model on certain words that we classified and have to correctly classify each word regardless of sentence structure. In OpenNLP we weren't able to accomplish that.
I'm guessing our training questions are to similar to each other and now it's assuming whatever follows "what was" is a department.
1. Is that a correct assumption and is there a better way of training these models?
Based on my testing and results I'm concluding yes the sequence and pattern of the words plays a part. I don't have any documentation to back that though. Also I can't find anything to get around that with OpenNLP.
- Is the best bet to break each entity into it's own model?
Based on experience and testing I'm resolving that separate models, as much as possible, is the best way to train. Unfortunately we still haven't been able to accomplish our goals even with this approach.
Ultimately what we've done to switch to StanfordNLP NER models. You can still do custom implementations around domain specific language, and have the option of turning off sequencing in the properties file:
usePrev=false
useNext=false
useDisjunctive=false
useSequences=false
usePrevSequences=false
Reference for custom NER in StanfordNLP: Stanford CoreNLP: Training your own custom NER tagger