Search code examples
javaweka

Weka J48 Classification Test Instances and Return Value


I have an object Instances that's the trainset to a J48 tree classifier. The insertion to this object is working. However now I need to classify new data. Let's say I have 24 attributes on the trainset. What's the most common way to represent the instances in the query set?

Each instance has to have 23 attributes (same as the trainset schema except the label value) or

Using the same schema as the trainset, defining the last attribute as label and when it runs the classifier, somehow omits the label (I have no certainty on this one)?

The second doubt comes after the classification:

The return value of the j48.classifyInstance(); This value is returned as a double and according to the API is the identifier of the class on the testset. However, the trainset.class.class_name_from_int, has its parameter as an int. does the double returned from classifyInstance only say 0, 1, ... numClasses-1 and I only need to cast to int or do I need to apply any math transformations like ceil or floor?


Solution

  • For your fist question: When you have a labelled test set, I think the second manner is better, as you can also evaluate your model when your test instances have labels also. Omitting labels is not necessary, as during the classification the model doesnt use it.

    For your second question, it is absolutely enough to cast the double value of classifyInstance like:

    String prediction = train.classAttribute().value((int)  classifier.classifyInstance(testInstance));