Search code examples
javanlpmalletcrf

Mallet CRF Sequence Classification Training Data Format


I am trying to train a CRF sequence model using the Mallet library but I am missing some important information. I found a an example in the library itself at https://github.com/mimno/Mallet/blob/master/src/cc/mallet/examples/TrainCRF.java however the example does not state the format of the input training data so I do not know how to recreate it.

Mallet does have a data import example at http://mallet.cs.umass.edu/import-devel.php but the particular example seems to be for document classification and not CRF sequence models which is my use case.

I tried putting the input training data in the form used at http://mallet.cs.umass.edu/sequences.php i.e.

Bill CAPITALIZED noun
slept non-noun
here LOWERCASE STOPWORD non-noun

and test data in the form

CAPITAL Al
        slept
        here

however based on the output logs it does not seem to be the correct format. For example one line in the log is INFO: testing label slept P � R 0 F1 � but slept is not a label - the labels should be noun or non-noun.

So if someone could tell me what format the training data should be in that would be great.


Solution

  • The code sample you link to has the line that refers to the training file commented out. Is it possible your code is trying to train on the test file? That would cause slept to look like a label since it's at the end of the line, and would explain the error.

    For the record, I tried the example using the test data you gave above (using the command line, not the code sample) and it worked, so the test/train format seems to be OK.