Search code examples

Training n-gram NER with Stanford NLP

Recently I have been trying to train n-gram entities with Stanford Core NLP. I have followed the following tutorials -

With this, I am able to specify only unigram tokens and the class it belongs to. Can any one guide me through so that I can extend it to n-grams. I am trying to extract known entities like movie names from chat data set.

Please guide me through in case I have mis-interpretted the Stanford Tutorials and the same can be used for the n-gram training.

What I am stuck with is the following property

#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1

Here the first column is the word (unigram) and the second column is the entity, for example

I   O
Emma    PERS
Woodhouse   PERS

Now that I need to train known entities (say movie names) like Hulk, Titanic etc as movies, it would be easy with this approach. But in case I need to train I know what you did last summer or Baby's day out, what is the best approach ?


  • It had been a long wait here for an answer. I have not been able to figure out the way to get it done using Stanford Core. However mission accomplished. I have used the LingPipe NLP libraries for the same. Just quoting the answer here because, I think someone else could benefit from it.

    Please check out the Lingpipe licencing before diving in for an implementation in case you are a developer or researcher or what ever.

    Lingpipe provides various NER methods.

    1) Dictionary Based NER

    2) Statistical NER (HMM Based)

    3) Rule Based NER etc.

    I have used the Dictionary as well as the statistical approaches.

    First one is a direct look up methodology and the second one being a training based.

    An example for the dictionary based NER can be found here

    The statstical approach requires a training file. I have used the file with the following format -

    <s> data line with the <ENAMEX TYPE="myentity">entity1</ENAMEX>  to be trained</s>
    <s> with the <ENAMEX TYPE="myentity">entity2</ENAMEX>  annotated </s>

    I then used the following code to train the entities.

    import com.aliasi.chunk.CharLmHmmChunker;
    import com.aliasi.corpus.parsers.Muc6ChunkParser;
    import com.aliasi.hmm.HmmCharLmEstimator;
    import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
    import com.aliasi.tokenizer.TokenizerFactory;
    import com.aliasi.util.AbstractExternalizable;
    public class TrainEntities {
        static final int MAX_N_GRAM = 50;
        static final int NUM_CHARS = 300;
        static final double LM_INTERPOLATION = MAX_N_GRAM; // default behavior
        public static void main(String[] args) throws IOException {
            File corpusFile = new File("inputfile.txt");// my annotated file
            File modelFile = new File("outputmodelfile.model"); 
            System.out.println("Setting up Chunker Estimator");
            TokenizerFactory factory
                = IndoEuropeanTokenizerFactory.INSTANCE;
            HmmCharLmEstimator hmmEstimator
                = new HmmCharLmEstimator(MAX_N_GRAM,NUM_CHARS,LM_INTERPOLATION);
            CharLmHmmChunker chunkerEstimator
                = new CharLmHmmChunker(factory,hmmEstimator);
            System.out.println("Setting up Data Parser");
            Muc6ChunkParser parser = new Muc6ChunkParser();  
            parser.setHandler( chunkerEstimator);
            System.out.println("Training with Data from File=" + corpusFile);
            System.out.println("Compiling and Writing Model to File=" + modelFile);

    And to test the NER I used the following class

    import java.util.ArrayList;
    import java.util.Set;
    import com.aliasi.chunk.Chunk;
    import com.aliasi.chunk.Chunker;
    import com.aliasi.chunk.Chunking;
    import com.aliasi.util.AbstractExternalizable;
    public class Recognition {
        public static void main(String[] args) throws Exception {
            File modelFile = new File("outputmodelfile.model");
            Chunker chunker = (Chunker) AbstractExternalizable
            String testString="my test string";
                Chunking chunking = chunker.chunk(testString);
                Set<Chunk> test = chunking.chunkSet();
                for (Chunk c : test) {
                    System.out.println(testString + " : "
                            + testString.substring(c.start(), c.end()) + " >> "
                            + c.type());

    Code Courtesy : Google :)