Search code examples
javamachine-learningclassificationmaxent

Creating training data for a Maxent classfier in Java


I am trying to create the java implementation for maxent classifier. I need to classify the sentences into n different classes.

I had a look at ColumnDataClassifier in stanford maxent classifier. But I am not able to understand how to create training data. I need training data in the form where training data includes POS Tags for words for sentence, so that the features used for classifier will be like previous word, next word etc.

I am looking for training data which has sentences with POS TAGGING and sentence class mentioned. example :

My/(POS) name/(POS) is/(POS) XYZ/(POS) CLASS

Any help will be appreciated.


Solution

  • If I understand it correctly, you are trying to treat sentences as a set of POS tags.

    In your example, the sentence "My name is XYZ" would be represented as a set of (PRP$, NN, VBZ, NNP). That would mean, every sentence is actually a binary vector of length 37 (because there are 36 possible POS tags according to this page + the CLASS outcome feature for the whole sentence)

    This can be encoded for OpenNLP Maxent as follows:

    PRP$=1 NN=1 VBZ=1 NNP=1 CLASS=SomeClassOfYours1
    

    or simply:

    PRP$ NN VBZ NNP CLASS=SomeClassOfYours1
    

    (For working code-snippet see my answer here: Training models using openNLP maxent)

    Some more sample data would be:

    1. "By 1978, Radio City had lost its glamour, and the owners of Rockefeller Center decided to demolish the aging hall."
    2. "In time he was entirely forgotten, many of his buildings were demolished, others insensitively altered."
    3. "As soon as she moved out, the mobile home was demolished, the suit said."
    4. ...

    This would yield samples:

    IN CD NNP VBD VBN PRP$ NN CC DT NNS IN TO VB VBG CLASS=SomeClassOfYours2
    IN NN PRP VBD RB VBN JJ IN PRP$ NNS CLASS=SomeClassOfYours3
    IN RB PRP VBD RP DT JJ NN VBN NN CLASS=SomeClassOfYours2
    ...
    

    However, I don't expect that such a classification yields good results. It would be better to make use of other structural features of a sentence, such as the parse tree or dependency tree that can be obtained using e.g. Stanford parser.

    Edited on 28.3.2016: You can also use the whole sentence as a training sample. However, be aware that: - two sentences might contain same words but have different meaning - there is a pretty high chance of overfitting - you should use short sentences - you need a huge training set

    According to your example, I would encode the training samples as follows:

    class=CLASS My_PRP name_NN is_VBZ XYZ_NNP
    ...
    

    Notice that the outcome variable comes as the first element on each line.

    Here is a fully working minimal example using opennlp-maxent-3.0.3.jar.


    package my.maxent;
    
    import java.io.File;
    import java.io.FileInputStream;
    import java.io.IOException;
    import java.io.InputStream;
    import java.util.zip.GZIPInputStream;
    
    import opennlp.maxent.GIS;
    import opennlp.maxent.io.GISModelReader;
    import opennlp.maxent.io.SuffixSensitiveGISModelWriter;
    import opennlp.model.AbstractModel;
    import opennlp.model.AbstractModelWriter;
    import opennlp.model.DataIndexer;
    import opennlp.model.DataReader;
    import opennlp.model.FileEventStream;
    import opennlp.model.MaxentModel;
    import opennlp.model.OnePassDataIndexer;
    import opennlp.model.PlainTextFileDataReader;
    
    public class MaxentTest {
    
    
        public static void main(String[] args) throws IOException {
    
            String trainingFileName = "training-file.txt";
            String modelFileName = "trained-model.maxent.gz";
    
            // Training a model from data stored in a file.
            // The training file contains one training sample per line.
            DataIndexer indexer = new OnePassDataIndexer( new FileEventStream(trainingFileName)); 
            MaxentModel trainedMaxentModel = GIS.trainModel(100, indexer); // 100 iterations
    
            // Storing the trained model into a file for later use (gzipped)
            File outFile = new File(modelFileName);
            AbstractModelWriter writer = new SuffixSensitiveGISModelWriter((AbstractModel) trainedMaxentModel, outFile);
            writer.persist();
    
            // Loading the gzipped model from a file
            FileInputStream inputStream = new FileInputStream(modelFileName);
            InputStream decodedInputStream = new GZIPInputStream(inputStream);
            DataReader modelReader = new PlainTextFileDataReader(decodedInputStream);
            MaxentModel loadedMaxentModel = new GISModelReader(modelReader).getModel();
    
            // Now predicting the outcome using the loaded model
            String[] context = {"is_VBZ", "Gaby_NNP"};
            double[] outcomeProbs = loadedMaxentModel.eval(context);
    
            String outcome = loadedMaxentModel.getBestOutcome(outcomeProbs);
            System.out.println("=======================================");
            System.out.println(outcome);
            System.out.println("=======================================");
        }
    
    }
    

    And some dummy training data (stored as training-file.txt):

    class=Male      My_PRP name_NN is_VBZ John_NNP
    class=Male      My_PRP name_NN is_VBZ Peter_NNP
    class=Female    My_PRP name_NN is_VBZ Anna_NNP
    class=Female    My_PRP name_NN is_VBZ Gaby_NNP
    

    This yields the following output:

    Indexing events using cutoff of 0
    Computing event counts...  done. 4 events
    Indexing...  done.
    Sorting and merging events... done. Reduced 4 events to 4.
    Done indexing.
    Incorporating indexed data for training...  
    done.
        Number of Event Tokens: 4
            Number of Outcomes: 2
          Number of Predicates: 7
    ...done.
    Computing model parameters ...
    Performing 100 iterations.
      1:  ... loglikelihood=-2.772588722239781  0.5
      2:  ... loglikelihood=-2.4410105407571203 1.0
          ...
     99:  ... loglikelihood=-0.16111520541752372    1.0
    100:  ... loglikelihood=-0.15953272940719138    1.0
    =======================================
    class=Female
    =======================================