I have the following problem: I'd like to pedict a factor-variable "cancer" (yes or no) using two variables "sex" and "agegroup" with a bayes classifier. These are my (fictional) sample data:
install.packages("e1071")
install.packages("gmodels")
library(e1071)
library(gmodels)
data<-read.csv("http://www.reduts.net/cancer.csv", sep=";", stringsAsFactors = T)
## Sex and Agegroup ##
######################
# classification
testset<-data[,c("sex", "agegroup")]
cancer<-data[,"cancer"]
model<-naiveBayes(testset, cancer)
model
# apply model on testset
testset$predicted<-predict(model, testset)
testset$cancer<-cancer
CrossTable(testset$predicted, testset$cancer, prop.chisq=F, prop.r=F, prop.c=F, prop.t = F)
The result shows me that according to my data males and younger people are more likely to have cancer. Compared to the real cancer-classification my model classifies 147 (=88+59) out of 200 cases correctly (73.5%).
| testset$original
testset$predicted | no | yes | Row Total |
------------------|-----------|-----------|-----------|
no | 88 | 12 | 100 |
------------------|-----------|-----------|-----------|
yes | 54 | 46 | 100 |
------------------|-----------|-----------|-----------|
Column Total | 142 | 58 | 200 |
------------------|-----------|-----------|-----------|
However, then I was doing the same thing using only one classification-variable (sex):
## Sex only ##
######################
# classification
testset2<-data[,c("sex")]
cancer<-data[,"cancer"]
model2<-naiveBayes(testset2, cancer)
model2
The model is as follows:
Naive Bayes Classifier for Discrete Predictors
Call:
naiveBayes.default(x = testset2, y = cancer)
A-priori probabilities:
cancer
no yes
0.645 0.355
Conditional probabilities:
x
cancer f m
no 0.4573643 0.5426357
yes 0.5774648 0.4225352
Obviously, males are more likely to have cancer compared to females (54% vs 46%).
# apply model on testset
testset2$predicted<-predict(model2, testset2)
testset2$cancer<-cancer
CrossTable(testset2$predicted, testset2$cancer, prop.chisq=F, prop.r=F, prop.c=F, prop.t = F)
Now, when I apply my model to the original data, all cases are classified as the same class:
Total Observations in Table: 200
| testset2$cancer
testset2$predicted | no | yes | Row Total |
-------------------|-----------|-----------|-----------|
no | 129 | 71 | 200 |
-------------------|-----------|-----------|-----------|
Column Total | 129 | 71 | 200 |
-------------------|-----------|-----------|-----------|
Can anyone please explain me, why both females and males are assigned to the same class?
You are misinterpreting those outputs. When you print out model2 and see
Conditional probabilities: x cancer f m no 0.4573643 0.5426357 yes 0.5774648 0.4225352
It is wrong to conclude "Obviously, males are more likely to have cancer compared to females (54% vs 46%)."
What this table is telling us is the four numbers
P(female | no cancer) P(male | no cancer) P(female | cancer) P(male | cancer)
It is easy to see this by looking at the output of
table(cancer, testset2) testset2 cancer f m no 59 70 yes 41 30
The first line of conditional probabilities from the model can be computed as follows: 129 people do not have cancer. 59/129 = 0.4573643 are female. 70/129 = 0.5426357 are male. So the way to read that first line is "Given that a patient does not have cancer, they are more likely to be male (54% vs 46%)".
Now to your question: Can anyone please explain me, why both females and males are assigned to the same class?
To decide which class males will be assigned to, you need to compare
P(Cancer | Male)
with P(No Cancer | male)
. Whichever is bigger,
we will declare to indicate the class. When using Naïve Bayes, these
are estimated by applying Bayes Rule to reformulate this as comparing
P(Cancer | Male) = P(Male | Cancer) * P(Cancer) / P(Male) with P(No Cancer | Male) = P(Male | No Cancer) * P(No Cancer) / P(Male)
The denominators are the same in both cases, so if we only care about which is bigger, we can compare the size of
P(Male | Cancer) * P(Cancer)
with P(Male | No Cancer) * P(No Cancer)
These are exactly the figures being reported when you print out the model.
So, for the males
P(Male | Cancer) * P(Cancer) = 0.4225352 * 0.355 = 0.15
P(Male | No Cancer) * P(No Cancer) = 0.5426357 * 0.645 = 0.35
(Note: these are not real probabilities because we ignored the denominator
P(Male)
) Since No Cancer has the higher number, we predict No Cancer for males.
Similarly, for females we compute
P(Female | Cancer) * P(Cancer) = 0.5774648 * 0.355 = 0.205
P(Female | No Cancer) * P(No Cancer) = 0.4573643 * 0.645 = 0.295
and for females too we predict no cancer. It may be useful to emphasize this
calculation for females. Even though P(Female | Cancer) > P(Female | No Cancer)
,
these are weighted by the overall probabilities P(Cancer)
and P(No Cancer)
.
Since overall it is more likely to have No Cancer rather than Cancer, that
switches which is bigger. Naïve Bayes predicts No Cancer for both genders.