I am trying to do a Naive Bayes classification using R (Package e1071). Tried the usual Golf example and I am always getting opposite result.
Scenario: If the weather is good, do I play Golf 'Yes' or 'No'? Very straightforward instance.
Created a training dataset (df) and as per the training dataset, i am expecting the result as 'Yes' for 'Good' weather but its giving me a 'No'.
[1] No
Levels: No Yes
Any reason why is it happening this way? Is my understanding wrong or am i doing something wrong?
All supports are much appreciated..
Cheers..!
weather <- c("Good", "Good", "Good", "Bad", "Bad","Good")
golf <- c("Yes","No","Yes","No","Yes","Yes")
df <- data.frame(weather, golf) #Training dataset
df[] <- lapply(df, factor) #changing df to factor variables
df_new <- data.frame(weather = "Good") #Test dataset
library(e1071)
model <- naiveBayes(golf ~.,data=df)
predict(model, df_new, type ="class")
This is because factor encoding can be misleading. Indeed, if you do not make sure that factors in df
and df_new
are encoded the same way, you will get (seemingly) absurd results compared to what you see.
Take a look at the integer encoding of df
print(df$weather)
Good Good Good Bad Bad Good
print(as.integer(df$weather))
2 2 2 1 1 2
And compare it to the encoding of df_new
print(df_new$weather)
Good
print(as.integer(df_new$weather))
1
Good
has been mapped to 1
in df_new
, while 1
corresponds to Bad
in df
. So when you are applying your model, your are asking for a prediction based on a Bad
weather.
You need to set the factors of df_new
the same way they are encoded in df
df_new <- data.frame(weather = "Good") #Test dataset
df_new$weather <- factor(df_new$weather, levels(df$weather))