Search code examples
opennlp

opennlp TokenNameFinder for entities different than names


I´m new to openNlp. I start training a model (TokenNameFinderTrainer), to identify names. So far so good, but now I want to identify organization (such as "Microsoft").

My question is: which types of entities does opennlp recognize by default? (if there is any ...)

I see that can handle <START:person> Daryl Williams <END> .

But is okay to create something like: <START:organization> Metro-Goldwyn-Mayer Studios Inc. <END>? or <START:company> Metro-Goldwyn-Mayer Studios Inc. <END>

Meaning: Can I label categories as I please? or

Do I have to use a default category for that?. That being the case, which are the default ones?

EDIT:

I have found the answers via further reading. I asking now for confirmation...

I can label entities as I please, and is wiser to make 1 model per entity, am I right there?.

For instance: 1 for locations, 1 for names, 1 for companies?

Any ideas on have to procede where the same (for instance) company is written like: Microsoft, and also microsoft?

Thanks for the help!


Solution

  • you can make a model for any NER model you want, i recommend one model per type. OpenNLP uses machine learning to find entities, so it will find what your model tells it to. So if you annotate microsoft and Microsoft, or even a misspelling of microsoft it will try to find it. If you have a small list of names, and only a few variants for each, and you need the results to be normalized, consider using a RegexNameFinder. If you pull the trunk you can construct the RegexNameFinder with a Map that maps a label to a set of regex patterns.

    EDIT: here is a link to the OpenNLP unit test cases for the RegexNameFinder. This is the 1.6-snapshot

    http://svn.apache.org/viewvc/opennlp/trunk/opennlp-tools/src/test/java/opennlp/tools/namefind/RegexNameFinderTest.java?view=co
    

    if the link won't work, here is a basic example.

      public void test() {
    
        Pattern testPattern = Pattern.compile("test");
        String sentence[] = new String[]{"a", "test", "b", "c"};
    
    
        Pattern[] patterns = new Pattern[]{testPattern};
        Map<String, Pattern[]> regexMap = new HashMap<>();
        String type = "testtype";
    
        regexMap.put(type, patterns);
    
        RegexNameFinder finder =
                new RegexNameFinder(regexMap);
    
        Span[] result = finder.find(sentence);
    
    
      }