Search code examples
rnaivebayes

Meaning of this statement in R (Naive Bayes Classifier)


I am going through the code that our professor has provided us with, for creating a Naive Bayes Classifier. Note that we are not using some built-in package; rather writing it ourselves for learning purposes.

One of the statements that the professor has used confuses me:

t = (Xtrain[,11] == c);

where Xtrain is the data set we are using to construct the classifier from. I guess I understand what Xtrain[,11] == c does, but what I don't get is the assignment to t. Could some one please let me know what it does and why?

Edit:

Following is the code that he is using to train the classifier:

X = read.csv("naive_bayes_binary.csv");
tnum = nrow(X)/2;  
Xtrain = X[1:tnum,];  # the data we construct the classifier from
p = matrix(0,3,10);  #  p[c,j] = P(x_j = 1 | Y = c)
prior = rep(0,3);  # will be prior probs
n = rep(0,3);  # will be class counts
for (c in 1:3) {
    t = (Xtrain[,11] == c);    ### What is this?
    n[c] = sum(t);
    for (j in 1:10) {
        p[c,j] = sum(Xtrain[t,j] == 1)/n[c]  
    # empirical prob that jth feat = 1 for cth class
    }
}
prior = n/tnum;  # the prior probabilities of the classes

Solution

  • As I mentioned in the comment, t is a vector of logicals, indicating the values equal to c in X[,11]. If you sum the vector t you'll get the number of occurrences (as True is 1 and False is 0).

    Here's a small working example:

    ## 10 classes
    n <- rep(0,10)
    
    # class number of interest
    c <- 7
    
    # data vector (in OP's example a column)
    X11 <- sample(1:10,100,replace = T)
    
    X11
          [1]  2  7  5 10  4  5  1  7  4  4  1  8  1  5  7  1 10  2  6  9 10  4  3  2  2  8  7 10  3  2  5  3 10  4  8  2  2  8  6  2  5  2
         [43]  1  4  9  3  3  4  9  7  5 10 10  9  6 10  9  8  7  9  8  2  1  1  4  5  3 10  4  9 10  3 10  1  7 10  6  8  3  1  9  5  5  2
         [85]  9  9  1  9  3  3  3 10  5  3  3  2  7  4  3 10
    
    
    # vector of logicals
    t <- X11 == c
    
    t
      [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
     [22] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
     [43] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
     [64] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
     [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
    
    # assign number of occurrences
    n[c] <- sum(t)
    

    The output of n shows 8 occurences:

    n
     [1] 0 0 0 0 0 0 8 0 0 0