c++opencv machine-learning naivebayes opencv3.1

Why is OpenCV3.1 NormalBayesClassifier's error rate so high in this example?

I'm trying to use OpenCV3.1's NormalBayesClassifier on a simple problem that I can easily generate training data for. I settled on classifying input numbers as even or odd. Obviously this can be computed directly with 100% accuracy, but the point is to exercise the ML capabilities of OpenCV in order to get familiar with it.

So, my first question is - is there a theoretical reason why NormalBayesClassifier wouldn't be an appropriate model for this problem?

If not, the second question is, why is my error rate so high? cv::ml::StatModel::calcError() is giving me outputs of 30% - 70%.

Third, what's the best way to bring the error rate down?

Here's a minimum, self-contained snippet that demonstrates the issue:

(To be clear, the classification/output should be 0 for an even number and 1 for an odd number).

#include <ml.h>
#include <iomanip>

int main() {

   const int numSamples = 1000;
   cv::RNG rng = cv::RNG::RNG((uint64) time(NULL));

   // construct training sample data

   cv::Mat samples;
   samples.create(numSamples, 1, CV_32FC1);

   for (int i = 0; i < numSamples; i++) {
      samples.at<float>(i) = (int)rng(10000);
   }

   // construct training response data

   cv::Mat responses;
   responses.create(numSamples, 1, CV_32SC1);

   for (int i = 0; i < numSamples; i++) {
      int sample = (int) samples.at<float>(i);
      int response = (sample % 2);
      responses.at<int>(i) = response;
   }

   cv::Ptr<cv::ml::TrainData> data = cv::ml::TrainData::create(samples, cv::ml::ROW_SAMPLE, responses);

   data->setTrainTestSplitRatio(.9);

   cv::Ptr<cv::ml::NormalBayesClassifier> classifier = cv::ml::NormalBayesClassifier::create();

   classifier->train(data);

   float errorRate = classifier->calcError(data, true, cv::noArray());

   std::cout << "Bayes error rate: [" << errorRate << "]" << std::endl;

   // construct prediction inputs
   const int numPredictions = 10;

   cv::Mat predictInputs;
   predictInputs.create(numPredictions, 1, CV_32FC1);

   for (int i = 0; i < numPredictions; i++) {
      predictInputs.at<float>(i) = (int)rng(10000);
   }

   cv::Mat predictOutputs;
   predictOutputs.create(numPredictions, 1, CV_32SC1);

   // run prediction

   classifier->predict(predictInputs, predictOutputs);

   int numCorrect = 0;

   for (int i = 0; i < numPredictions; i++) {
      int input = (int)predictInputs.at<float>(i);
      int output = predictOutputs.at<int>(i);
      bool correct = (input % 2 == output);

      if (correct)
         numCorrect++;

      std::cout << "Input = [" << (int)predictInputs.at<float>(i) << "], " << "predicted output = [" << predictOutputs.at<int>(i) << "], " << "correct = [" << (correct ? "yes" : "no") << "]"  << std::endl;
   }

   float percentCorrect = (float)numCorrect / numPredictions * 100.0f;

   std::cout << "Percent correct = [" << std::fixed << std::setprecision(0) << percentCorrect << "]" << std::endl;
}

Sample run output:

Bayes error rate: [36]
Input = [9150], predicted output = [1], correct = [no]
Input = [3829], predicted output = [0], correct = [no]
Input = [4985], predicted output = [0], correct = [no]
Input = [8113], predicted output = [1], correct = [yes]
Input = [7175], predicted output = [0], correct = [no]
Input = [811], predicted output = [1], correct = [yes]
Input = [699], predicted output = [1], correct = [yes]
Input = [7955], predicted output = [1], correct = [yes]
Input = [8282], predicted output = [1], correct = [no]
Input = [1818], predicted output = [0], correct = [yes]
Percent correct = [50]

Solution

In your code you provide to the algorithm a single feature, which is the number to classify. That is not enough, unless you provide several examples of the same numbers, multiple times. If you want the learning algorithm to learn something about odd vs even, you need to think about what kind of features could be used by the classifier to learn that. Most machine learning techniques require careful feature engineering by you first.

Since you want to experiment with ML, I suggest the following:

For each number, create say 5 features, one to encode each digit. Thus, 5 would be 00005 or f1=0, f2=0, f3=0, f4=0, f5=0 and 11098 would be f1=1, f2=2, f3=0, f4=9, f5=8.
If you have numbers larger than that, you can keep only the last 5 digits.
Train your classifier
Test with the same encoding. What you'd like from your classifier is to learn that only the last digit is important to determine odd vs even

If you want to play more with it, you could encode numbers in binary format. Which would make it even easier for the classifier to learn what makes a number odd or even.