Search code examples
nlpstanford-nlp

Stanford NLP: set RegexNERAnnotator to caseInsensitive


I am identifying qualifications in a large corpus. I am using NamedEntityTagAnnotation.

Problem:

My annotations are read in as case sensitive. I want them to be case insensitive. Hence

Bachelor's Degree DEGREE

does not need an additional entry of

Bachelor's degree DEGREE

I know this is possible. RegexNERAnnotator has a field for ignoreCase. But I don't know how to access RegexNERAnnotator through the API.

My current code (which I cadged off the internet and works apart from the case issue) is as follows:

        String prevNeToken = "O";
    String currNeToken = "O";
    boolean newToken = true;
    for (CoreLabel token : sentence.get(TokensAnnotation.class))
    {
      currNeToken = token.get(NamedEntityTagAnnotation.class);

      String word = token.get(TextAnnotation.class);

      if (currNeToken.equals("O"))
      {

        if (!prevNeToken.equals("O") && (sbuilder.length() > 0))
        {
          handleEntity(prevNeToken, sbuilder, tokens);
          newToken = true;
        }
        continue;
      }

      if (newToken)
      {
        prevNeToken = currNeToken;
        newToken = false;
        sbuilder.append(word);
        continue;
      }

      if (currNeToken.equals(prevNeToken))
      {
        sbuilder.append(" " + word);
      }
      else
      {

        handleEntity(prevNeToken, sbuilder, tokens);
        newToken = true;
      }
      prevNeToken = currNeToken;
    }

Any assistance would be greatly appreciated.


Solution

  • The answer is in how you set up the pipeline.

        Properties props = new Properties();
    
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, regexner, depparse,  natlog,  openie");
    
    
        //props.put("regexner.mapping", namedEntityPropertiesPath);
    
        pipeline = new StanfordCoreNLP(props);
        pipeline.addAnnotator(new TokensRegexNERAnnotator(namedEntityPropertiesPath, true));
    

    Do not use props.put("regexner.mapping", namedEntityPropertiesPath);

    Use pipeline.addAnnotator.

    The first argument to the constructor is the path to your NER data file. The second is a boolean caseInsensitive.

    Note, that this then uses Stanford's NER lists as well as your own. It also uses a more complex NER data file.

    See http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/pipeline/TokensRegexNERAnnotator.html