I have the following problem: I'd like to pedict a factor-variable "cancer" (yes or no) using two variables "sex" and "agegroup" with a bayes classifier. These are my (fictional) sample data:
data<-read.csv("http://www.reduts.net/cancer.csv", sep=";", stringsAsFactors = T)
## Sex and Agegroup ##
# classification
testset<-data[,c("sex", "agegroup")]
model<-naiveBayes(testset, cancer)
# apply model on testset
testset$predicted<-predict(model, testset)
CrossTable(testset$predicted, testset$cancer, prop.chisq=F, prop.r=F, prop.c=F, prop.t = F)
The result shows me that according to my data males and younger people are more likely to have cancer. Compared to the real cancer-classification my model classifies 147 (=88+59) out of 200 cases correctly (73.5%).
| testset$original
testset$predicted | no | yes | Row Total |
no | 88 | 12 | 100 |
yes | 54 | 46 | 100 |
Column Total | 142 | 58 | 200 |
However, then I was doing the same thing using only one classification-variable (sex):
## Sex only ##
# classification
model2<-naiveBayes(testset2, cancer)
The model is as follows:
Naive Bayes Classifier for Discrete Predictors
naiveBayes.default(x = testset2, y = cancer)
A-priori probabilities:
no yes
0.645 0.355
Conditional probabilities:
cancer f m
no 0.4573643 0.5426357
yes 0.5774648 0.4225352
Obviously, males are more likely to have cancer compared to females (54% vs 46%).
# apply model on testset
testset2$predicted<-predict(model2, testset2)
CrossTable(testset2$predicted, testset2$cancer, prop.chisq=F, prop.r=F, prop.c=F, prop.t = F)
Now, when I apply my model to the original data, all cases are classified as the same class:
Total Observations in Table: 200
| testset2$cancer
testset2$predicted | no | yes | Row Total |
no | 129 | 71 | 200 |
Column Total | 129 | 71 | 200 |
Can anyone please explain me, why both females and males are assigned to the same class?
You are misinterpreting those outputs. When you print out model2 and see
Conditional probabilities: x cancer f m no 0.4573643 0.5426357 yes 0.5774648 0.4225352
It is wrong to conclude "Obviously, males are more likely to have cancer compared to females (54% vs 46%)."
What this table is telling us is the four numbers
P(female | no cancer) P(male | no cancer) P(female | cancer) P(male | cancer)
It is easy to see this by looking at the output of
table(cancer, testset2) testset2 cancer f m no 59 70 yes 41 30
The first line of conditional probabilities from the model can be computed as follows: 129 people do not have cancer. 59/129 = 0.4573643 are female. 70/129 = 0.5426357 are male. So the way to read that first line is "Given that a patient does not have cancer, they are more likely to be male (54% vs 46%)".
Now to your question: Can anyone please explain me, why both females and males are assigned to the same class?
To decide which class males will be assigned to, you need to compare
P(Cancer | Male)
with P(No Cancer | male)
. Whichever is bigger,
we will declare to indicate the class. When using Naïve Bayes, these
are estimated by applying Bayes Rule to reformulate this as comparing
P(Cancer | Male) = P(Male | Cancer) * P(Cancer) / P(Male) with P(No Cancer | Male) = P(Male | No Cancer) * P(No Cancer) / P(Male)
The denominators are the same in both cases, so if we only care about which is bigger, we can compare the size of
P(Male | Cancer) * P(Cancer)
with P(Male | No Cancer) * P(No Cancer)
These are exactly the figures being reported when you print out the model.
So, for the males
P(Male | Cancer) * P(Cancer) = 0.4225352 * 0.355 = 0.15
P(Male | No Cancer) * P(No Cancer) = 0.5426357 * 0.645 = 0.35
(Note: these are not real probabilities because we ignored the denominator
) Since No Cancer has the higher number, we predict No Cancer for males.
Similarly, for females we compute
P(Female | Cancer) * P(Cancer) = 0.5774648 * 0.355 = 0.205
P(Female | No Cancer) * P(No Cancer) = 0.4573643 * 0.645 = 0.295
and for females too we predict no cancer. It may be useful to emphasize this
calculation for females. Even though P(Female | Cancer) > P(Female | No Cancer)
these are weighted by the overall probabilities P(Cancer)
and P(No Cancer)
Since overall it is more likely to have No Cancer rather than Cancer, that
switches which is bigger. Naïve Bayes predicts No Cancer for both genders.