How to classify documents using Naive Bayes and Principal Component Analysis (C#, Accord.NET)

I'm working on an emailclassification project that will classify emails into a certain category. So far, we save interesting data (e.g.: subject and body) along with other information onto our database. We have succesfully applied Term Frequency - Inverse Document Frequency to the project to retrieve a matrix of all the terms/features found within the subject and bodies of our emails. A very small sample output of that matrix would be:

      dog    cat    fish
doc1  0,024  0,011  0,008
doc2  0,011  0,014  0,007
doc3  0,005  0,024  0,003
doc4  0,008  0,028  0,008
doc5  0,002  0,03   0,006

In reality this matrix is much larger since we have approximately 23000 terms for a set of 165 emails. Because we need to classify emails using the terms in this matrix, 23000 features are simply too much. That's why we've implemented a dimension reduction algorithm (PCA). This is done by using this code (Accord framework):

// Creates the Principal Component Analysis of the given source
pca = new PrincipalComponentAnalysis(matrix, AnalysisMethod.Center);

// Compute the Principal Component Analysis
pca.Compute();         

// Creates a projection of the information
double[,] components = pca.Transform(matrix, 20);

// Creates form to show components
frmRPCA frmPCA = new frmRPCA(components);
frmPCA.ShowDialog();

Right now we've hardcoded the # of dimensions, but that shouldn't be an issue for the time being.

I've been looking at the example of the Accord framework on how to classify using, Naive Bayes, but I can't really figure out how to put it to practice. Mainly because the example uses text while we're working with numbers and I'm not very understanding of how the classification works. See the example on how to implement Naive Bayes.

Basically, I have my original matrix containing my features and their TF-IDF values (see sample above) and I want to classify them by using the matrix containing my PCA's (output of pca.Transform method). At the moment, I only have 2 classes in which I want to classify my emails (Registration and Submission). How would I achieve this? Also, how would I expand this in case I want to add multiple classes in the future?

Example output should be something like:

doc1 Registration
doc2 Registration
doc3 Registration
doc4 Submission
doc5 Submission

Solution

If you are interested in doing classification, then perhaps LDA (and its variants) would be better suited for your case. The fact is that PCA attempts to minimize the variance by looking solely at your data. However, if you have extra information about your data, such as class labels, there are better ways to achieve what you need.

If you have extra information in the form of class labels (it is, each sample in your dataset has an associated integer value that represents to which class it belongs), then you can use LDA (Linear Discriminant Analysis) to reduce the dimensionality in a way that it is useful for classification.
If you have extra information in the form of real outputs (it is, each sample in your dataset actually has an double value associated with it), then you can use PLS (Partial Least Squares) to reduce the dimensionality in a way that is useful for regression.

Assuming that you have a classification problem, here is an example on how to reduce the number of features data using LDA:

// Create some sample input data instances. This is the same
// data used in the Gutierrez-Osuna's example available at:
// http://research.cs.tamu.edu/prism/lectures/pr/pr_l10.pdf

double[][] inputs = 
{
    // Class 0
    new double[] {  4,  1 }, 
    new double[] {  2,  4 },
    new double[] {  2,  3 },
    new double[] {  3,  6 },
    new double[] {  4,  4 },

    // Class 1
    new double[] {  9, 10 },
    new double[] {  6,  8 },
    new double[] {  9,  5 },
    new double[] {  8,  7 },
    new double[] { 10,  8 }
};

int[] output = 
{
    0, 0, 0, 0, 0, // The first five are from class 0
    1, 1, 1, 1, 1  // The last five are from class 1
};

// Then, we will create a LDA for the given instances.
var lda = new LinearDiscriminantAnalysis(inputs, output);

lda.Compute(); // Compute the analysis


// Now we can project the data into LDA space:
double[][] projection = lda.Transform(inputs);

If you would like to reduce the problem from 2 dimensions to 1 dimension, you could use:

double[][] reduced_data = lda.Transform(inputs, 1);

The result would be a 10x1 matrix. It will contain a lower dimensional representation of the data that is still useful to perform classification. So instead of using your original data to learn the classifiers, you will be able to use reduced_data instead.

Also, the LDA object comes with a simple minimum distance classifier that you can use to classify your instances. For example, you can classify your dataset using

int[] results = lda.Classify(inputs);

However, nothing prevents you from using any other classifier that you might like (such as Naive Bayes). For example, in order to use Naive Bayes, you could use

// Create a new normal distribution Naive Bayes classifier for 
// a classification problem with 1 feature and the two classes
var nb = new NaiveBayes.Normal(classes: 2, inputs: 1);

// Compute the Naive Bayes model
nb.Estimate(reduced_data, output);

// Now, if we would like to classify the first instance 
// in our dataset, we would use
int result = nb.Compute(lda.Transform(input[0]));

There are also sample applications that come with the framework that should demonstrate how LDA works and how naive bayes works.