Python weka wrapper: classify instances from another file

I'm using Ubuntu 15.10, Python 2.7, and have the current install of the python weka-wrapper package.

I'm doing the following: (1) Training a classifier based on data I load from a .csv file. (2) Loading a second set of data from another .csv file -- this data has the same header that designates features as was used to train the original classifier. (3) I'm attempting to use the trained classifier to classify the data loaded from this second file -- what I really want, actually, is the probability of each instance conforming to a certain class (but that is an aside I will point out shortly)

Here is my code that takes the trained classifier and (second) file name as input:

 def classifyData(classifier,datFile):

    loader = Loader(classname="weka.core.converters.CSVLoader")
    data = loader.load_file(datFile)
    data.class_is_last()

    preds = []
    dists = []

    iCount = 0
    for inst in data:
           iCount+=1
           pred = classifier.classify_instance(inst)
           dist = classifier.distribution_for_instance(inst)
           preds.append(pred)
           dists.append(dists)

    return preds,dists

Note: I should also note that the class variable (the last feature in the second data file) is designated with "?", representing data for which I do not have a label.

Quick aside question: Does the dist variable contain the probability of the class? If not, how would I get this information?

Running this function produces the following error:

 Exception in thread "Thread-0" java.lang.ArrayIndexOutOfBoundsException: 1
at weka.classifiers.meta.Bagging.distributionForInstance(Bagging.java:816)
at weka.classifiers.AbstractClassifier.classifyInstance(AbstractClassifier.java:173)
 Traceback (most recent call last):
 File "parsFunc.py", line 33, in main
initProb = classifyData(classifTrain,ttDir+"temp.csv")
 File "parsFunc.py", line 136, in classifyData
pred = classifier.classify_instance(inst)
 File "/usr/local/lib/python2.7/dist-packages/weka/classifiers.py", line 105, in classify_instance
return self.__classify(inst.jobject)
 File "/usr/local/lib/python2.7/dist-packages/javabridge/jutil.py", line 852, in fn
raise JavaException(x)
 javabridge.jutil.JavaException: 1

Not exactly what's going wrong here. I know there are the same number of instances in the second file as was used to train the model and that the header is the same. Any help would be appreciated!

Solution

javabridge.jutil.JavaException: 1 is not very helpful, I know, but it points to an ArrayIndexOutOfBoundsException. Weka requires training and test sets (or data that you want to make predictions for) to have the exact same format, not only enforcing the order of the attributes, but also the order of the labels (in case of nominal attributes). The latter is necessary, since Weka stores label indices internally as a number and therefore the internal representation of a label, e.g., 1, has a different meaning in case of labels {yes,no} as opposed to {no,yes}.

When using CSV files, the amount of labels and/or the order of labels cannot be ensured, as the CSVLoader uses whatever strings it encounters as labels. In your case, you don't have any labels in your class attribute column (they're all denoted as missing), which most likely results in that exception that you encountered.

What to do? Use ARFF files instead of CSV files, as they have a header which defines the attributes (and the labels in case of nominal ones). By storing the header of the training set on disk, you can then re-use that to create your test set with the correct structure.

Quick aside answer: Yes, dist contains the class probabilities, aligned with the class labels.