J48 decision tree in Matlab using Weka for API

I am using Weka for API in Matlab, and I encountered a problem..

I want to add a value in an attribute in my "Test" document:

@relation test.txt-weka.filters.unsupervised.attribute.NumericToNominal-Rlast

@attribute att_2 numeric
@attribute att_2 numeric
@attribute att_3 {0,1}

@data

I want to add '2' this value in att_3 like : @attribute att_3 {0,1,2}

And I tried to command :

test.attribute(2).addStringValue(2) or test.attribute(2).addStringValue('2')

these two all went wrong .. Can anyone help me :(

Below is my code Refer to How to retrieve class values from WEKA using MATLAB

%# Set paths
WEKA_HOME = 'C:\Program Files\Weka-3-8';
javaaddpath([WEKA_HOME '\weka.jar']);
import weka.cores.meta.MatlabLoader.*;

%# load dataset 
load mydata
X = feas;
Y = grp2idx(species);

%# 10-fold crossvalidation
k=10;
cvFolds = crossvalind('Kfold', species, k);   %# get indices of 10-fold CV
cp = classperf(species);   

for i = 1:k  
testIdx = (cvFolds == i); 
trainIdx = ~testIdx;   
xtrain = feas(trainIdx,:);
ytrain = species(trainIdx);
xtest = feas(testIdx,:);
ytest = species(testIdx);

train = [xtrain ytrain];
test =  [xtest ytest];
save train.txt train -ascii
save test.txt test -ascii

fName = 'train.txt';
loader = weka.core.converters.MatlabLoader();
loader.setFile( java.io.File(fName) );
train = loader.getDataSet();
train.setClassIndex( train.numAttributes()-1 );

fName = 'test.txt';
loader = weka.core.converters.MatlabLoader();
loader.setFile( java.io.File(fName) );
test = loader.getDataSet();
test.setClassIndex( test.numAttributes()-1 );

%# convert last attribute (class) from numeric to nominal
filter = weka.filters.unsupervised.attribute.NumericToNominal();
filter.setOptions( weka.core.Utils.splitOptions('-R last') );
filter.setInputFormat(train); 
train = filter.useFilter(train, filter);

filter = weka.filters.unsupervised.attribute.NumericToNominal();
filter.setOptions( weka.core.Utils.splitOptions('-R last') );
filter.setInputFormat(test);   
test = filter.useFilter(test, filter);

%# train J48 tree
classifier = weka.classifiers.trees.J48();
classifier.setOptions( weka.core.Utils.splitOptions('-O -B -J -A -S -M 1') );
classifier.buildClassifier( train );

%# classify test instances
numInst = test.numInstances();
pred = zeros(numInst,1);
predProbs = zeros(numInst, train.numClasses());

for i=1:numInst
pred(i) = classifier.classifyInstance( test.instance(i-1) );
end

for i=1:numInst
predProbs(i,:) = classifier.distributionForInstance( test.instance(i-1) );
end

eval = weka.classifiers.Evaluation(train);
eval.evaluateModel(classifier, test, javaArray('java.lang.Object',1));

disp( char(eval.toSummaryString()) )

end

My Data set contains 31 datas , each data has two attributes and one class.

The class of my dataset contains three values : '0' or '1' or '2'. Only 4 datas are in class '2' , other 27 datas are in class '0' or '1'.

When I'm using 10 fold cross-validation , only few folds (about 4 folds) won't get wrong when I run the code.

But the rest (6 folds) will all show error messages: "Subscripted assignment dimension mismatch." while running

for i=1:numInst
predProbs(i,:) = classifier.distributionForInstance( test.instance(i-1) );
end

First I don't know why , and then I found those 6 folds don't contain data which is in class '2' in their "Test data". They only have datas in class '0' or '1' in their "Test data" and have datas in class '0','1','2' in their "Train data".

And the folds which can run successfully contains data in class '2' in their "Test Data".

Those which doesn't contain class '2' in their "Test" shows

@relation test.txt-weka.filters.unsupervised.attribute.NumericToNominal-Rlast

@attribute att_1 numeric
@attribute att_2 numeric
@attribute att_3 {0,1}

@data
864.86315,40.15,0
1396.0296,36.263158,0
249.6065,71.5,1

So, I'm wondering if I should add '2' in @attribute att_3 {0,1} to solve the problem.. Or the problem is not this?

Solution

I think you need to apply weka.filters.unsupervised.attribute.AddValues with options -C last -L 0,1,2 to your test data, after converting the class attribute to nominal but before attempting to use the test data for prediction. This will ensure that the class attribute in the test data matches the one in the training data even if the test dataset doesn't contain any instances with a given class value.

I assume there is a reason that you want to do the partitioning for cross-validation in MATLAB before building the models in Weka though, rather than just letting Weka build a cross-validated model?