Search code examples
stanford-nlptokenize

Patterns do not behave as expected


The actual patterns are not in English, so I created this simplified example to reproduce the problem: there are 3 levels of annotations (required for real application) and the 3rd level pattern does not work as expected. The phrase to be recognized is: a b c

What I expect:

  • 1st level: "a" is annotated as A, "b" is annotated as "B"
  • 2nd: if there are annotations A and B, annotate them all together as AB
  • 3rd: if at least one annotation AB is present and there is word "c", annotate them all together as C Patterns are shown below.
# 1.
{  pattern: (/a/), action: (Annotate($0, name, "A")) }
{  pattern: (/b/), action: (Annotate($0, name, "B")) }
# 2.
{  pattern: (([name:A]) ([name:B])), action: (Annotate($0, name, "AB")) }
# 3.
{  pattern: (([name:AB]+) /c/), action: (Annotate($0, name, "C")) }

#1 and #2 works and "a b" are annotated: matched token: NamedEntitiesToken{word='a' name='AB' beginPosition=0 endPosition=1} matched token: NamedEntitiesToken{word='b' name='AB' beginPosition=2 endPosition=3} But the #3 pattern doesn't work even though one can see that we have 2 "AB" annotated tokens and it is exactly what is expected by #3 pattern. Even more if I change #1 to be

{  pattern: (/a/), action: (Annotate($0, name, "AB")) }
{  pattern: (/b/), action: (Annotate($0, name, "AB")) }

pattern #3 works correctly: matched token: NamedEntitiesToken{word='a' name='C' beginPosition=0 endPosition=1} matched token: NamedEntitiesToken{word='b' name='C' beginPosition=2 endPosition=3} matched token: NamedEntitiesToken{word='c' name='C' beginPosition=4 endPosition=5}

I can't find any difference between matched tokens when I use

# In this case #3 pattern works
{  pattern: (/a/), action: (Annotate($0, name, "AB")) }
{  pattern: (/b/), action: (Annotate($0, name, "AB")) }

or when I use

# In this case #3 pattern doesn't work
# 1.
{  pattern: (/a/), action: (Annotate($0, name, "A")) }
{  pattern: (/b/), action: (Annotate($0, name, "B")) }
# 2.
{  pattern: (([name:A]) ([name:B])), action: (Annotate($0, name, "AB")) }

In both cases I get the same annotation, but first scenario works and the second doesn't. What am I doing wrong?


Solution

  • This works for me:

    # these Java classes will be used by the rules
    ner = { type: "CLASS", value: "edu.stanford.nlp.ling.CoreAnnotations$NamedEntityTagAnnotation" }
    
    ENV.defaults["stage"] = 1
    
    { ruleType: "tokens", pattern: (/a/), action: Annotate($0, ner, "A") }
    { ruleType: "tokens", pattern: (/b/), action: Annotate($0, ner, "B") }
    
    ENV.defaults["stage"] = 2
    
    { ruleType: "tokens", pattern: ([{ner: "A"}] [{ner: "B"}]), action: Annotate($0, ner, "AB") }
    
    ENV.defaults["stage"] = 3
    
    { ruleType: "tokens", pattern: ([{ner: "AB"}]+ /c/), action: Annotate($0, ner, "ABC") }
    

    There is a write up about TokensRegex here:

    https://stanfordnlp.github.io/CoreNLP/tokensregex.html