java machine-learning regression weka libsvm

Java, weka LibSVM does not predict correctly

I'm using LibSVM with the weka in my java code. I am trying to do a regression. Below is my code,

public static void predict() {

    try {
        DataSource sourcePref1 = new DataSource("train_pref2new.arff");
        Instances trainData = sourcePref1.getDataSet();

        DataSource sourcePref2 = new DataSource("testDatanew.arff");
        Instances testData = sourcePref2.getDataSet();

        if (trainData.classIndex() == -1) {
            trainData.setClassIndex(trainData.numAttributes() - 2);
        }

        if (testData.classIndex() == -1) {
            testData.setClassIndex(testData.numAttributes() - 2);
        }

        LibSVM svm1 = new LibSVM();

        String options = ("-S 3 -K 2 -D 3 -G 1000.0 -R 0.0 -N 0.5 -M 40.0 -C 1.0 -E 0.001 -P 0.1");
        String[] optionsArray = options.split(" ");
        svm1.setOptions(optionsArray);

        svm1.buildClassifier(trainData);

        for (int i = 0; i < testData.numInstances(); i++) {

            double pref1 = svm1.classifyInstance(testData.instance(i));                
            System.out.println("predicted value : " + pref1);

        }

    } catch (Exception ex) {
        Logger.getLogger(Test.class.getName()).log(Level.SEVERE, null, ex);
    }
}

But the predicted value I am getting from this code is different than the predicted value I am getting by using the Weka GUI.

Example: Below is a single testing data that I have given for both java code and weka GUI.

The Java code predicted the value as 1.9064516129032265 while the Weka GUI's predicted value is 10.043. I am using the same training data set and the same parameters for both Java code and Weka GUI.

I hope you understand my question.Could any one tell me whats wrong with my code?

Solution

You are using the wrong algorithm to perform SVM regression. LibSVM is used for classification. The one you want is SMOreg, which a specific SVM for regression.

Below is a complete example that shows how to use SMOreg using both the Weka Explorer GUI as well as the Java API. For data, I will use the cpu.arff data file that comes with the Weka distribution. Note that I'll use this file for both training and test, but ideally you would have separate data sets.

Using the Weka Explorer GUI

Open the WEKA Explorer GUI, click on the Preprocess tab, click on Open File, and then open the cpu.arff file that should be in your Weka distribution. On my system, the file is under weka-3-8-1/data/cpu.arff. The Explorer window should look like the following:

Click on the Classify tab. It should really be called "Prediction" because you can do both classification and regression here. Under Classifier, click on Choose and then select weka --> classifiers --> functions --> SMOreg, as shown below.

Now build the regression model and evaluate it. Under Test Options choose Use training set so that our the training set is used for testing as well (as I mentioned above, this is not the ideal methodology). Now press Start, and the result should look like the following:

Make a note of the RMSE value (74.5996). We'll revisit that in the Java code implementation.

Using the Java API

Below is a complete Java program that uses the Weka API to replicate the results shown earlier in the Weka Explorer GUI.

import weka.classifiers.functions.SMOreg;
import weka.classifiers.Evaluation;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;

public class Tester {

    /**
     * Builds a regression model using SMOreg, the SVM for regression, and 
     * evaluates it with the Evalution framework.
     */
    public void buildAndEvaluate(String trainingArff, String testArff) throws Exception {

        System.out.printf("buildAndEvaluate() called.\n");

        // Load the training and test instances.
        Instances trainingInstances = DataSource.read(trainingArff);
        Instances testInstances = DataSource.read(testArff);

        // Set the true value to be the last field in each instance.
        trainingInstances.setClassIndex(trainingInstances.numAttributes()-1);
        testInstances.setClassIndex(testInstances.numAttributes()-1);

        // Build the SMOregression model.
        SMOreg smo = new SMOreg();
        smo.buildClassifier(trainingInstances);

        // Use Weka's evaluation framework.
        Evaluation eval = new Evaluation(trainingInstances);
        eval.evaluateModel(smo, testInstances);

        // Print the options that were used in the ML algorithm.
        String[] options = smo.getOptions();
        System.out.printf("Options used:\n");
        for (String option : options) {
            System.out.printf("%s ", option);
        }
        System.out.printf("\n\n");

        // Print the algorithm details.
        System.out.printf("Algorithm:\n %s\n", smo.toString());

        // Print the evaluation results.
        System.out.printf("%s\n", eval.toSummaryString("\nResults\n=====\n", false));
    }

    /**
     * Builds a regression model using SMOreg, the SVM for regression, and 
     * tests each data instance individually to compute RMSE.
     */
    public void buildAndTestEachInstance(String trainingArff, String testArff) throws Exception {

        System.out.printf("buildAndTestEachInstance() called.\n");

        // Load the training and test instances.
        Instances trainingInstances = DataSource.read(trainingArff);
        Instances testInstances = DataSource.read(testArff);

        // Set the true value to be the last field in each instance.
        trainingInstances.setClassIndex(trainingInstances.numAttributes()-1);
        testInstances.setClassIndex(testInstances.numAttributes()-1);

        // Build the SMOregression model.
        SMOreg smo = new SMOreg();
        smo.buildClassifier(trainingInstances);

        int numTestInstances = testInstances.numInstances();

        // This variable accumulates the squared error from each test instance.
        double sumOfSquaredError = 0.0;

        // Loop over each test instance.
        for (int i = 0; i < numTestInstances; i++) {

            Instance instance = testInstances.instance(i);

            double trueValue = instance.value(testInstances.classIndex());
            double predictedValue = smo.classifyInstance(instance);

            // Uncomment the next line to see every prediction on the test instances.
            //System.out.printf("true=%10.5f, predicted=%10.5f\n", trueValue, predictedValue);

            double error = trueValue - predictedValue;
            sumOfSquaredError += (error * error);
        }

        // Print the RMSE results.
        double rmse = Math.sqrt(sumOfSquaredError / numTestInstances);
        System.out.printf("RMSE = %10.5f\n", rmse);
    }

    public static void main(String argv[]) throws Exception {

        Tester classify = new Tester();
        classify.buildAndEvaluate("../weka-3-8-1/data/cpu.arff", "../weka-3-8-1/data/cpu.arff");
        classify.buildAndTestEachInstance("../weka-3-8-1/data/cpu.arff", "../weka-3-8-1/data/cpu.arff");
    }
}

I've written two functions that train an SMOreg model and evaluate the model by running prediction on the training data.

buildAndEvaluate() evaluates the model by using the Weka Evaluation framework to run a suite of tests to get the exact same results as the Explorer GUI. Notably, it produces an RMSE value.
buildAndTestEachInstance() evaluates the model by explicitly looping over each test instance, making a prediction, computing the error, and computing an overall RMSE. Note that this RMSE matches the one from buildAndEvaluate(), which in turn matches the one from the Explorer GUI.

Below is the result from compiling and running the program.

prompt> javac -cp weka.jar Tester.java

prompt> java -cp .:weka.jar Tester

buildAndEvaluate() called.
Options used:
-C 1.0 -N 0 -I weka.classifiers.functions.supportVector.RegSMOImproved -T 0.001 -V -P 1.0E-12 -L 0.001 -W 1 -K weka.classifiers.functions.supportVector.PolyKernel -E 1.0 -C 250007 

Algorithm:
 SMOreg

weights (not support vectors):
 +       0.01   * (normalized) MYCT
 +       0.4321 * (normalized) MMIN
 +       0.1847 * (normalized) MMAX
 +       0.1175 * (normalized) CACH
 +       0.0973 * (normalized) CHMIN
 +       0.0235 * (normalized) CHMAX
 -       0.0168



Number of kernel evaluations: 21945 (93.081% cached)

Results
=====

Correlation coefficient                  0.9044
Mean absolute error                     31.7392
Root mean squared error                 74.5996
Relative absolute error                 33.0908 %
Root relative squared error             46.4953 %
Total Number of Instances              209     

buildAndTestEachInstance() called.
RMSE =   74.59964