nlp stanford-nlp named-entity-recognition

How to suppress unmatched words in Stanford NER classifiers?

I am new to Stanford NLP and NER and trying to train a custom classifier with a data sets of currencies and countries.

My training data in training-data-currency.tsv looks like -

USD CURRENCY
GBP CURRENCY

And, training data in training-data-countries.tsv looks like -

USA COUNTRY
UK  COUNTRY

And, classifiers properties look like -

trainFileList = classifiers/training-data-currency.tsv,classifiers/training-data-countries.tsv
ner.model=classifiers/english.conll.4class.distsim.crf.ser.gz,classifiers/english.muc.7class.distsim.crf.ser.gz,classifiers/english.all.3class.distsim.crf.ser.gz
serializeTo = classifiers/my-classification-model.ser.gz
map = word=0,answer=1

useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC

Java code to find the categories is -

LinkedHashMap<String, LinkedHashSet<String>> map = new<String, LinkedHashSet<String>> LinkedHashMap();
NERClassifierCombiner classifier = null;
try {
    classifier = new NERClassifierCombiner(true, true, 
            "C:\\Users\\perso\\Downloads\\stanford-ner-2015-04-20\\stanford-ner-2015-04-20\\classifiers\\my-classification-model.ser.gz"
            );
} catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}
List<List<CoreLabel>> classify = classifier.classify("Zambia");
for (List<CoreLabel> coreLabels : classify) {
    for (CoreLabel coreLabel : coreLabels) {

        String word = coreLabel.word();
        String category = coreLabel
                .get(CoreAnnotations.AnswerAnnotation.class);
        if (!"O".equals(category)) {
            if (map.containsKey(category)) {
                map.get(category).add(word);
            } else {
                LinkedHashSet<String> temp = new LinkedHashSet<String>();
                temp.add(word);
                map.put(category, temp);
            }
            System.out.println(word + ":" + category);
        }

    }

}

When I run the above code with input as "USD" or "UK", I get expected result as "CURRENCY" or "COUNTRY". But, when I input something like "Russia", return value is "CURRENCY" which is from the first train file in the properties. I am expecting 'O' would be returned for these values which is not present in my training dat.

How can I achieve this behavior? Any pointers where I am going wrong would be really helpful.

Solution

Hi I'll try to help out!

So it sounds to me like you have a list of strings that should be called "CURRENCY", and you have a list of strings that should be called "COUNTRY", etc...

And you want something to tag strings based off of your list. So when you see "RUSSIA", you want it to be tagged "COUNTRY", when you see "USD", you want it to be tagged "CURRENCY".

I think these tools will be more helpful for you (particularly the first one):

http://nlp.stanford.edu/software/regexner/

http://nlp.stanford.edu/software/tokensregex.shtml

The NERClassifierCombiner is designed to train on large volumes of tagged sentences and look at a variety of features including the capitalization and the surrounding words to make a guess about a given word's NER label.

But it sounds to me in your case you just want to explicitly tag certain sequences based off of your pre-defined list. So I would explore the links I provided above.

Please let me know if you need any more help and I will be happy to follow up!