Search code examples
rnaivebayesfactors

factors in prediction dataframe for naive_bayes in R


I am trying to understand how to create a dataframe of factors to predict an outcome using naive_bayes. All the examples I have seen take a single dataframe and split it into two dfs(training and test). This does work for me:

library(naivebayes)

#Basic naive-bayes model with prediction/test dataframe a subset of the original 

age_class<-c('x3','x2','x2','x1','x3','x1')
student<-c('n','y','n','y','y','y')
inc<-c('m','h','m','m','m','l')
sav<-c('e','f','e','e','f','f')
buy<-c('N','Y','Y','Y','Y','Y')

df<-data.frame(age_class,student,inc,sav,buy)

nbmod<-naive_bayes(buy~ age_class + student +inc + sav, data=df[2:6,])

predictdf<-df[1,1:4]

predict(nbmod,newdata=predictdf)

Do I have to create a dataframe to predict on by specifying the all the levels every time? Is there a way to leverage the information about the factor levels in the orginal dataframe (df)?

age_class<-factor('x3', levels=c('x1','x2','x3'))
student<-factor('n', levels=c('n','y'))
inc<-factor('m', levels=c('h','l','m'))
sav<-factor('e',levels=c('e','f'))

predictdf3<-data.frame(age_class,student,inc,sav)

predict(nbmod,newdata=predictdf3)

Solution

  • For this particular case you probably can reference original levels by levels():

    predictdf3 <- data.frame(
        age_class = factor("x3", levels = levels(df$age_class)),
        student = factor("n", levels = levels(df$student)),
        inc = factor("m", levels = levels(df$inc)),
        sav = factor("e", levels = levels(df$sav))
    )
    

    Note that encoding of factors should match in train and test data. It definitely has to be consistent. So you have either to merge (and then split) your train and test datasets, or to copy factor levels from your train dataset to the test dataset.