I am identifying qualifications in a large corpus. I am using NamedEntityTagAnnotation.
Problem:
My annotations are read in as case sensitive. I want them to be case insensitive. Hence
Bachelor's Degree DEGREE
does not need an additional entry of
Bachelor's degree DEGREE
I know this is possible. RegexNERAnnotator has a field for ignoreCase. But I don't know how to access RegexNERAnnotator through the API.
My current code (which I cadged off the internet and works apart from the case issue) is as follows:
String prevNeToken = "O";
String currNeToken = "O";
boolean newToken = true;
for (CoreLabel token : sentence.get(TokensAnnotation.class))
{
currNeToken = token.get(NamedEntityTagAnnotation.class);
String word = token.get(TextAnnotation.class);
if (currNeToken.equals("O"))
{
if (!prevNeToken.equals("O") && (sbuilder.length() > 0))
{
handleEntity(prevNeToken, sbuilder, tokens);
newToken = true;
}
continue;
}
if (newToken)
{
prevNeToken = currNeToken;
newToken = false;
sbuilder.append(word);
continue;
}
if (currNeToken.equals(prevNeToken))
{
sbuilder.append(" " + word);
}
else
{
handleEntity(prevNeToken, sbuilder, tokens);
newToken = true;
}
prevNeToken = currNeToken;
}
Any assistance would be greatly appreciated.
The answer is in how you set up the pipeline.
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, regexner, depparse, natlog, openie");
//props.put("regexner.mapping", namedEntityPropertiesPath);
pipeline = new StanfordCoreNLP(props);
pipeline.addAnnotator(new TokensRegexNERAnnotator(namedEntityPropertiesPath, true));
Do not use props.put("regexner.mapping", namedEntityPropertiesPath);
Use pipeline.addAnnotator.
The first argument to the constructor is the path to your NER data file. The second is a boolean caseInsensitive.
Note, that this then uses Stanford's NER lists as well as your own. It also uses a more complex NER data file.
See http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/TokensRegexNERAnnotator.html