Search code examples
rstatisticsbayesiannaivebayes

Query regarding Naive Bayes algorithm in package e1071 R


Below is the training dataset that I am using for Naive Bayes implementation in R(using e1071 package) where: X,Y,Z are the different classes and V1,V2,V3,V4,V5 are the attributes:-

Class   V1  V2  V3  V4  V5
X       Yes Yes No  Yes Yes
X       Yes Yes No  No  Yes
X       Yes Yes No  No  Yes
X       Yes Yes No  No  Yes
X        No Yes No  No  Yes
X        No Yes No  No  Yes
X        No Yes No  No  Yes
X        No No  No  No  No
X        No No  No  No  No
X        No No  No  No  No
X        No No  No  No  No
X        No No  No  No  No
X        No No  No  No  No
X        No No  No  No  No
X        No No  No  No  No
X        No No  No  No  No
Y       Yes Yes Yes No  Yes
Y        No No  No  No  Yes
Y        No No  No  No  Yes
Y        No No  No  No  No
Y        No No  No  No  No
Y        No No  No  No  No
Y        No No  No  No  No
Z        No Yes Yes No  Yes
Z        No No  No  No  Yes
Z        No No  No  No  Yes
Z        No No  No  No  No
Z        No No  No  No  No
Z        No No  No  No  No
Z        No No  No  No  No

The prior probabilities for the above dataset are X->0.5333333 Y->0.2333333 Z->0.2333333

and the conditional probabilities are :-

V1
Y          No       Yes
   X 0.7500000 0.2500000
   Y 0.8571429 0.1428571
   Z 1.0000000 0.0000000

V2
Y          No       Yes
   X 0.5625000 0.4375000
   Y 0.8571429 0.1428571
   Z 0.8571429 0.1428571

V3
 Y          No       Yes
   X 1.0000000 0.0000000
   Y 0.8571429 0.1428571
   Z 0.8571429 0.1428571

V4
 Y       No    Yes
   X 0.9375 0.0625
   Y 1.0000 0.0000
   Z 1.0000 0.0000

V5
 Y          No       Yes
   X 0.5625000 0.4375000
   Y 0.5714286 0.4285714
   Z 0.5714286 0.4285714

Case 1:- Laplace smoothing not used

I want to find out in which class does V3 belong to, given value Yes. So I have my test data as :-

V3
Yes

So, I have to find out probability of each class ie, Probability(X| V3=Yes), Probability(Y| V3=Yes),Probability(Z| V3=Yes) and take the maximum out of the three. Now,

Probability(X| V3=Yes)= Probability(X) * Probability(V3=Yes|X)/ P(V3)

From the conditional probability mentioned above, we know that Probability(V3=Yes|X)=0 So, Probability(X| V3=Yes) should be 0 and Probability(Y| V3=Yes),Probability(Z| V3=Yes) should be 0.5 each.

But in R output is different. From the package e1071 I have used naiveBayes function. Below is the code and its corresponding output:-

#model_nb<-naiveBayes(Class~.,data = train,laplace=0)
#results<-predict(model_nb,test,type = "raw")
#print(results)

#         X         Y         Z
#[1,] 0.5714286 0.2142857 0.2142857

Can someone please explain as to why such is the output in R?

Case 2:- Laplace smoothing used

Same scenario as Case1 w.r.t. Test Data, only difference being laplace used is 1. So, again I have to find out probability of each class ie, Probability(X| V3=Yes), Probability(Y| V3=Yes),Probability(Z| V3=Yes) and take the maximum out of the three.

Below are the conditional probabilities after laplace smoothing(k=1)

V1
Y          No       Yes
   X 0.7222222 0.2777778
   Y 0.7777778 0.2222222
   Z 0.8888889 0.1111111

V2
Y          No       Yes
   X 0.5555556 0.4444444
   Y 0.7777778 0.2222222
   Z 0.7777778 0.2222222

V3
Y          No        Yes
   X 0.94444444 0.05555556
   Y 0.77777778 0.22222222
   Z 0.77777778 0.22222222

V4
Y          No       Yes
   X 0.8888889 0.1111111
   Y 0.8888889 0.1111111
   Z 0.8888889 0.1111111

V5
Y          No       Yes
   X 0.5555556 0.4444444
   Y 0.5555556 0.4444444
   Z 0.5555556 0.4444444

From naive bayes definition,

Probability(X| V3=Yes)= Probability(X) * Probability(V3=Yes|X)/ P(V3)

Probability(Y| V3=Yes)= Probability(Y) * Probability(V3=Yes|X)/ P(V3)

Probability(Z| V3=Yes)= Probability(Z) * Probability(V3=Yes|X)/ P(V3)

After Calculation I have,

Probability(X| V3=Yes)= 0.53 * 0.05555556 / P(V3)=0.029/P(V3)

Probability(Y| V3=Yes)= 0.23 * 0.22222222 / P(V3)=0.051/P(V3)

Probability(Z| V3=Yes)= 0.23 * 0.22222222 / P(V3)=0.051/P(V3)

From the above calculation, there should be a tie between class Y and Z. But in R output is different. Class X is being shown as output class. Below is the code and its corresponding output:-

#model_nb<-naiveBayes(Class~.,data = train,laplace=1)
#results<-predict(model_nb,test,type = "raw")
#print(results)


#        X         Y         Z
#[1,] 0.5811966 0.2094017 0.2094017

Again, can someone please explain why is such the output in R? Am I going wrong anywhere with my calculation?

Also, need some explanation on how P(V3) would be calculated when laplace smoothing is done.

Thanks in advance!


Solution

  • The problem is that you are using just one sample for the test dataset, with only one value of V3. If you give a bit more test data you get sensible/expected results (focusing only on your case 1):

    test <- data.frame(V3=c("Yes", "No"))
    predict(model_nb, test, type="raw")
                   X         Y         Z
    [1,] 0.007936508 0.4960317 0.4960317
    [2,] 0.571428571 0.2142857 0.2142857
    

    Note you don't get exactly 0, 0.5, 0.5 for V3="Yes", since the function is using a threshold -which you can adjust, do ?predict.naiveBayes for more info.

    The problem is actually due to the internal implementation of predict.naiveBayes (the source code is at CRAN repository). I'm not going to go into all the details, but basically I've debugged the function, and in a certain step there is this line,

    newdata <- data.matrix(newdata)
    

    which will later decide which column of the conditional probabilities to use. With your original data the data.matrix looks like this:

    data.matrix(data.frame(V3="Yes"))
         V3
    [1,]  1
    

    thus it later assumes that the conditional probabilities were to be taken from column 1, i.e values 1.0000000, 0.8571429 and 0.8571429 for V3="No", and that's why you were getting results as if V3 was actually "No".

    However,

    data.matrix(data.frame(V3=c("Yes", "No")))
         V3
    [1,]  2
    [2,]  1
    

    gives column 2 of the conditional probabilities when V3 is "Yes", and thus you get the right result.

    I'm pretty sure your case 2 is just analogous.

    Hope it helps.

    EDIT after comments: I guess the easier way to solve it would be to put all the data in one data.frame, and select the indexes you use for training/testing your model. Many functions accept subset to select the data you use for training, and naiveBayes is no exception. However, for predict.naiveBayes you have to select the index. Something like this.

    all_data <- rbind(train, c(NA, NA, NA, "Yes", NA, NA))
    trainIndex <- 1:30
    model_nb <- naiveBayes(Class~., data=all_data, laplace=0, subset=trainIndex)
    predict(model_nb, all_data[-trainIndex,], type="raw")
    

    gives the expected result.

                   X         Y         Z
    [1,] 0.007936508 0.4960317 0.4960317
    

    Note that this works because in this case when you do the data.matrix operation you get the right result.

    data.matrix(all_data[-trainIndex,])
       Class V1 V2 V3 V4 V5
    31    NA NA NA  2 NA NA
    

    EDIT2 after comments: Some more details on why this is happening.

    When you define your test dataframe including only one value equal to "No", the conversion performed by data.matrix has actually no way to know that your variable V3 has 2 possible values, "Yes" and "No". test$V3 is actually a factor:

    test <- data.frame(V3="Yes")
    class(test$V3)
    [1] "factor"
    

    and as said it has only one level (no way for the data.frame to know there are actually 2)

    levels(test$V3)
    [1] "Yes"
    

    The implementation of data.matrix, as you can see in the docs, uses the levels of the factor:

    Factors and ordered factors are replaced by their internal codes.

    Thus when converting test to data.matrix it interprets there's only one possible value of the factor and decodes it,

    data.matrix(test)
         V3
    [1,]  1
    

    However, when you do the trick of putting training and test into the same dataframe, the factor levels are properly defined.

    levels(all_data$V3)
    [1] "No"  "Yes"
    

    The result would be the same if you did this:

    test <- data.frame(V3=factor("Yes", levels=levels(all_data$V3)))
    test
       V3
    1 Yes
    levels(test$V3)
    [1] "No"  "Yes"
    data.matrix(test)
         V3
    [1,]  2