python nlp nltk stanford-nlp named-entity-recognition

NLTK : combining stanford tagger and personal tagger

The goal of my project is to answer queries such as, for example: "I am looking for American women between 20 and 30 years old who work in Google" I then have to process the query and to look into a DB to find the answer.

For this, I would need to combine the Stanford 3-class NERTagger and my own tagger. Indeed, my NER tagger can tag ages, nationalities and gender. But I need the Stanford tagger to tag organizations as I don't have any training file for this.

Right now, I have a code like this:

def __init__(self, q):
    self.userQuery = q
def get_tagged_tokens(self):
    st = NERTagger('C:\stanford-ner-2015-01-30\my-ner-model.ser.gz','C:\stanford-ner-2015-01-30\stanford-ner.jar')
    result = st.tag(self.userQuery.split())[0]
    return result

And I would like to have something like this:

def get_tagged_tokens(self):
    st = NERTagger('C:\stanford-ner-2015-01-30\my-ner-model.ser.gz','C:\stanford-ner-2015-01-30\stanford-ner.jar')
    st_def = NERTagger('C:\stanford-ner-2015-01-30\classifiers\english.all.3class.distsim.crf.ser.gz','C:\stanford-ner-2015-01-30\stanford-ner.jar')
    tagger = BackoffTagger([st, st_def])
    result = st.tag(self.userQuery.split())[0]
    return result

This would mean that the tagger first uses my tagger and then the stanford one to tag untagged words.

Is it possible to combine my model with the Stanford model just to tag organizations? If yes, what is the best way to perform this?

Thank you!

Solution

The new NERClassifierCombiner with Stanford CoreNLP 3.5.2 or the new Stanford NER 3.5.2 has added command line functionality that makes it easy to get this effect with NLTK.

When you provide a list of serialized classifiers, NERClassifierCombiner will run them in sequence. After one tagger tags the sentence, no other taggers will tag tokens that have already been tagged. So note in my demo code I provide 2 classifiers as an example. They are run in the order you place them. I believe you can put as many as 10 in there if I recall correctly!

First, make sure that you have the latest copy of Stanford CoreNLP 3.5.2 or Stanford NER 3.5.2 , so that you have the right .jar file with this new functionality.

Second, make sure your custom NER model was built with Stanford CoreNLP or Stanford NER, this won't work otherwise! It should be ok if you used older versions.

Third, I have provided some sample code that should work, the main gist of this is to subclass NERTagger:

If people would like I could look into pushing this to NLTK so it is in there by default!

Here is some sample code (it is a little hacky since I was just rushing this out the door, for instance in NERComboTagger's constructor there is no point to the first argument being classifier_path1, but the code would crash if I didn't put a valid file there):

#!/usr/bin/python

from nltk.tag.stanford import NERTagger

class NERComboTagger(NERTagger):

  def __init__(self, *args, **kwargs):
    self.stanford_ner_models = kwargs['stanford_ner_models']
    kwargs.pop("stanford_ner_models")
    super(NERComboTagger,self).__init__(*args, **kwargs)

  @property
  def _cmd(self):
    return ['edu.stanford.nlp.ie.NERClassifierCombiner',
            '-ner.model',
            self.stanford_ner_models,
            '-textFile',
            self._input_file_path,
            '-outputFormat',
            self._FORMAT,
            '-tokenizerFactory',
            'edu.stanford.nlp.process.WhitespaceTokenizer',
            '-tokenizerOptions',
            '\"tokenizeNLs=false\"']

classifier_path1 = "classifiers/english.conll.4class.distsim.crf.ser.gz"
classifier_path2 = "classifiers/english.muc.7class.distsim.crf.ser.gz"

ner_jar_path = "stanford-ner.jar"

st = NERComboTagger(classifier_path1,ner_jar_path,stanford_ner_models=classifier_path1+","+classifier_path2)

print st.tag("Barack Obama is from Hawaii .".split(" "))

Note the major change in the subclass is what is returned by _cmd .

Also note that I ran this in the unzipped folder stanford-ner-2015-04-20 , so the paths are relative to that.

I get this output:

[('Barack','PERSON'), ('Obama', 'PERSON'), ('is','O'), ('from', 'O'), ('Hawaii', 'LOCATION'), ('.', 'O')]

Here is a link to the Stanford NER page:

http://nlp.stanford.edu/software/CRF-NER.shtml

Please let me know if you need any more help or if there are any errors in my code, I may have made a mistake while transcribing, but it works on my laptop!