Search code examples
rnaivebayes

Naive Bayes classification in R with opposite result..


I am trying to do a Naive Bayes classification using R (Package e1071). Tried the usual Golf example and I am always getting opposite result.

Scenario: If the weather is good, do I play Golf 'Yes' or 'No'? Very straightforward instance.

Created a training dataset (df) and as per the training dataset, i am expecting the result as 'Yes' for 'Good' weather but its giving me a 'No'.

[1] No
Levels: No Yes

Any reason why is it happening this way? Is my understanding wrong or am i doing something wrong?

All supports are much appreciated..

Cheers..!

weather <- c("Good", "Good", "Good", "Bad", "Bad","Good")
golf <- c("Yes","No","Yes","No","Yes","Yes")
df <- data.frame(weather, golf) #Training dataset

df[] <- lapply(df, factor) #changing df to factor variables

df_new <- data.frame(weather = "Good") #Test dataset

library(e1071)
model <- naiveBayes(golf ~.,data=df)
predict(model, df_new, type ="class")

Solution

  • This is because factor encoding can be misleading. Indeed, if you do not make sure that factors in df and df_new are encoded the same way, you will get (seemingly) absurd results compared to what you see.

    Take a look at the integer encoding of df

    print(df$weather)
    Good Good Good Bad  Bad  Good
    print(as.integer(df$weather))
    2 2 2 1 1 2
    

    And compare it to the encoding of df_new

    print(df_new$weather)
    Good
    print(as.integer(df_new$weather))
    1
    

    Good has been mapped to 1 in df_new, while 1 corresponds to Bad in df. So when you are applying your model, your are asking for a prediction based on a Bad weather.

    You need to set the factors of df_new the same way they are encoded in df

    df_new <- data.frame(weather = "Good") #Test dataset
    df_new$weather <- factor(df_new$weather, levels(df$weather))