Search code examples
javaweka

WEKA: Classify instances with a deserialized model


I used Weka Explorer:

  • Loaded the arff file
  • Applied StringToWordVector filter
  • Selected IBk as the best classifier
  • Generated/Saved my_model.model binary

In my Java code I deserialize the model:

    URL curl = ClassUtility.findClasspathResource( "models/my_model.model" );
    final Classifier cls = (Classifier) weka.core.SerializationHelper.read( curl.openConnection().getInputStream() );

Now, I have the classifier BUT I need somehow the information on the filter. Where I am getting is: how do I prepare an instance to be classified by my deserialized model (how do I apply the filter before classification) - (The raw instance that I have to classify has a field text with tokens in it. The filter was supposed to transform that into a list of new attributes)

I even tried to use a FilteredClassifier where I set the classifier to the deserialized on and the filter to a manually created instance of StringToWordVector

    final StringToWordVector filter = new StringToWordVector();
    filter.setOptions(new String[]{"-C", "-P x_", "-L"});
    FilteredClassifier fcls = new FilteredClassifier();
    fcls.setFilter(filter);
    fcls.setClassifier(cls);

The above does not work either. It throws the exception:

Exception in thread "main" java.lang.NullPointerException: No output instance format defined

What I am trying to avoid is doing the training in the Java code. It can be very slow and the prospect is that I might have multiple classifiers to train (different algorithms as well) and I want my app to start fast.


Solution

  • Your problem is that your model doesn't know anything about what the filter did to the data. The StringToWordVector filter changes the data, but depending on the input (training) data. A model trained on this transformed data set will only work on data that underwent the exact same transformation. To guarantee this, the filter needs to be part of your model.

    Using a FilteredClassifier is the correct idea, but you have to use it from the beginning:

    • Load the ARFF file
    • Select FilteredClassifier as classifier
    • Select StringToWordVector as filter for it
    • Select IBk as classifier for the FilteredClassifier
    • Generate/Save the model to my_model.binary

    The trained and serialized model will then also contain the intialized filter, including the information on how to transform data.