Search code examples
named-entity-recognitionstanford-nlp

Which settings should be used for TokensregexNER


When I try regexner it works as expected with the following settings and data;

props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, regexner");

Bachelor of Laws DEGREE
Bachelor of (Arts|Laws|Science|Engineering|Divinity) DEGREE

What I would like to do is that using TokenRegex. For example

Bachelor of Laws DEGREE
Bachelor of ([{tag:NNS}] [{tag:NNP}]) DEGREE

I read that to do this, I should use TokensregexNERAnnotator.

I tried to use it as follows, but it did not work.

Pipeline.addAnnotator(new TokensRegexNERAnnotator("expressions.txt", true));

Or I tried setting annotator in another way,

props.setProperty("annotators", "tokenize, cleanxml, ssplit, pos, lemma, tokenregexner");    
props.setProperty("customAnnotatorClass.tokenregexner", "edu.stanford.nlp.pipeline.TokensRegexNERAnnotator");

I tried to different TokenRegex formats but either annotator could not find the expression or I got SyntaxException.

What is the proper way to use TokenRegex (query on tokens with tags) on NER data file ?

BTW I just see a comment in TokensRegexNERAnnotator.java file. Not sure if it is related pos tags does not work with RegexNerAnnotator.

if (entry.tokensRegex != null) {
    // TODO: posTagPatterns...
    pattern = TokenSequencePattern.compile(env, entry.tokensRegex);
  }

Solution

  • First you need to make a TokensRegex rule file (sample_degree.rules). Here is an example:

    ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
    
    { pattern: (/Bachelor/ /of/ [{tag:NNP}]), action: Annotate($0, ner, "DEGREE") }
    

    To explain the rule a bit, the pattern field is specifying what type of pattern to match. The action field is saying to annotate every token in the overall match (that is what $0 represents), annotate the ner field (note that we specified ner = ... in the rule file as well, and the third parameter is saying set the field to the String "DEGREE").

    Then make this .props file (degree_example.props) for the command:

    customAnnotatorClass.tokensregex = edu.stanford.nlp.pipeline.TokensRegexAnnotator
    
    tokensregex.rules = sample_degree.rules
    
    annotators = tokenize,ssplit,pos,lemma,ner,tokensregex
    

    Then run this command:

    java -Xmx8g edu.stanford.nlp.pipeline.StanfordCoreNLP -props degree_example.props -file sample-degree-sentence.txt -outputFormat text
    

    You should see that the three tokens you wanted tagged as "DEGREE" will be tagged.

    I think I will push a change to the code to make tokensregex link to the TokensRegexAnnotator so you won't have to specify it as a custom annotator. But for now you need to add that line in the .props file.

    This example should help in implementing this. Here are some more resources if you want to learn more:

    http://nlp.stanford.edu/software/tokensregex.shtml#TokensRegexRules

    http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/SequenceMatchRules.html

    http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/ling/tokensregex/types/Expressions.html