Search code examples
rmachine-learningclassificationnaivebayes

Naive Bayes Classification with R - strange result


I have the following problem: I'd like to pedict a factor-variable "cancer" (yes or no) using two variables "sex" and "agegroup" with a bayes classifier. These are my (fictional) sample data:

install.packages("e1071")
install.packages("gmodels")
library(e1071)
library(gmodels)

data<-read.csv("http://www.reduts.net/cancer.csv", sep=";", stringsAsFactors = T)

## Sex and Agegroup ##
######################

# classification 
testset<-data[,c("sex", "agegroup")]
cancer<-data[,"cancer"]
model<-naiveBayes(testset, cancer)
model

# apply model on testset
testset$predicted<-predict(model, testset)
testset$cancer<-cancer

CrossTable(testset$predicted, testset$cancer, prop.chisq=F, prop.r=F,    prop.c=F, prop.t = F)

The result shows me that according to my data males and younger people are more likely to have cancer. Compared to the real cancer-classification my model classifies 147 (=88+59) out of 200 cases correctly (73.5%).

                  | testset$original 
testset$predicted |        no |       yes | Row Total | 
------------------|-----------|-----------|-----------|
               no |        88 |        12 |       100 | 
------------------|-----------|-----------|-----------|
              yes |        54 |        46 |       100 | 
------------------|-----------|-----------|-----------|
     Column Total |       142 |        58 |       200 | 
------------------|-----------|-----------|-----------|

However, then I was doing the same thing using only one classification-variable (sex):

## Sex only         ##
######################

# classification 
testset2<-data[,c("sex")]
cancer<-data[,"cancer"]
model2<-naiveBayes(testset2, cancer)
model2

The model is as follows:

Naive Bayes Classifier for Discrete Predictors

Call:
naiveBayes.default(x = testset2, y = cancer)

A-priori probabilities:
cancer
   no   yes 
0.645 0.355 

Conditional probabilities:
      x
cancer         f         m
   no  0.4573643 0.5426357
   yes 0.5774648 0.4225352

Obviously, males are more likely to have cancer compared to females (54% vs 46%).

# apply model on testset
testset2$predicted<-predict(model2, testset2)
testset2$cancer<-cancer

CrossTable(testset2$predicted, testset2$cancer, prop.chisq=F, prop.r=F, prop.c=F, prop.t = F)

Now, when I apply my model to the original data, all cases are classified as the same class:

Total Observations in Table:  200 

                   | testset2$cancer 
testset2$predicted |        no |       yes | Row Total | 
-------------------|-----------|-----------|-----------|
                no |       129 |        71 |       200 | 
-------------------|-----------|-----------|-----------|
      Column Total |       129 |        71 |       200 | 
-------------------|-----------|-----------|-----------|

Can anyone please explain me, why both females and males are assigned to the same class?


Solution

  • You are misinterpreting those outputs. When you print out model2 and see

    Conditional probabilities:
          x
    cancer         f         m
       no  0.4573643 0.5426357
       yes 0.5774648 0.4225352
    

    It is wrong to conclude "Obviously, males are more likely to have cancer compared to females (54% vs 46%)."

    What this table is telling us is the four numbers

    P(female | no cancer)     P(male | no cancer) 
    P(female | cancer)        P(male | cancer)
    

    It is easy to see this by looking at the output of

    table(cancer, testset2)
          testset2
    cancer  f  m
       no  59 70
       yes 41 30
    

    The first line of conditional probabilities from the model can be computed as follows: 129 people do not have cancer. 59/129 = 0.4573643 are female. 70/129 = 0.5426357 are male. So the way to read that first line is "Given that a patient does not have cancer, they are more likely to be male (54% vs 46%)".

    Now to your question: Can anyone please explain me, why both females and males are assigned to the same class?

    To decide which class males will be assigned to, you need to compare
    P(Cancer | Male) with P(No Cancer | male). Whichever is bigger, we will declare to indicate the class. When using Naïve Bayes, these are estimated by applying Bayes Rule to reformulate this as comparing

    P(Cancer | Male) = P(Male | Cancer) * P(Cancer) / P(Male)  
    with  
    P(No Cancer | Male) = P(Male | No Cancer) * P(No Cancer) / P(Male)
    

    The denominators are the same in both cases, so if we only care about which is bigger, we can compare the size of

    P(Male | Cancer) * P(Cancer) with P(Male | No Cancer) * P(No Cancer)

    These are exactly the figures being reported when you print out the model.

    So, for the males

    P(Male | Cancer) * P(Cancer)        = 0.4225352 * 0.355 = 0.15
    P(Male | No Cancer) * P(No Cancer)  = 0.5426357 * 0.645 = 0.35
    

    (Note: these are not real probabilities because we ignored the denominator P(Male) ) Since No Cancer has the higher number, we predict No Cancer for males.

    Similarly, for females we compute

    P(Female | Cancer) * P(Cancer)      = 0.5774648 * 0.355 = 0.205
    P(Female | No Cancer) * P(No Cancer)    = 0.4573643 * 0.645 = 0.295
    

    and for females too we predict no cancer. It may be useful to emphasize this calculation for females. Even though P(Female | Cancer) > P(Female | No Cancer), these are weighted by the overall probabilities P(Cancer) and P(No Cancer).
    Since overall it is more likely to have No Cancer rather than Cancer, that switches which is bigger. Naïve Bayes predicts No Cancer for both genders.