I am going through the code that our professor has provided us with, for creating a Naive Bayes Classifier. Note that we are not using some built-in package; rather writing it ourselves for learning purposes.
One of the statements that the professor has used confuses me:
t = (Xtrain[,11] == c);
where Xtrain
is the data set we are using to construct the classifier from. I guess I understand what Xtrain[,11] == c
does, but what I don't get is the assignment to t
. Could some one please let me know what it does and why?
Edit:
Following is the code that he is using to train the classifier:
X = read.csv("naive_bayes_binary.csv");
tnum = nrow(X)/2;
Xtrain = X[1:tnum,]; # the data we construct the classifier from
p = matrix(0,3,10); # p[c,j] = P(x_j = 1 | Y = c)
prior = rep(0,3); # will be prior probs
n = rep(0,3); # will be class counts
for (c in 1:3) {
t = (Xtrain[,11] == c); ### What is this?
n[c] = sum(t);
for (j in 1:10) {
p[c,j] = sum(Xtrain[t,j] == 1)/n[c]
# empirical prob that jth feat = 1 for cth class
}
}
prior = n/tnum; # the prior probabilities of the classes
As I mentioned in the comment, t
is a vector of logicals, indicating the values equal to c
in X[,11]
. If you sum the vector t
you'll get the number of occurrences (as True
is 1
and False
is 0
).
Here's a small working example:
## 10 classes
n <- rep(0,10)
# class number of interest
c <- 7
# data vector (in OP's example a column)
X11 <- sample(1:10,100,replace = T)
X11
[1] 2 7 5 10 4 5 1 7 4 4 1 8 1 5 7 1 10 2 6 9 10 4 3 2 2 8 7 10 3 2 5 3 10 4 8 2 2 8 6 2 5 2
[43] 1 4 9 3 3 4 9 7 5 10 10 9 6 10 9 8 7 9 8 2 1 1 4 5 3 10 4 9 10 3 10 1 7 10 6 8 3 1 9 5 5 2
[85] 9 9 1 9 3 3 3 10 5 3 3 2 7 4 3 10
# vector of logicals
t <- X11 == c
t
[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[22] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[43] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
[64] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# assign number of occurrences
n[c] <- sum(t)
The output of n
shows 8 occurences:
n
[1] 0 0 0 0 0 0 8 0 0 0