Search code examples
rnaivebayes

How to create Naive Bayes in R for numerical and categorical variables


I am trying to implement a Naive Bayes model in R based on known information:

Age group, e.g. "18-24" and "25-34", etc.
Gender, "male" and "female"
Region, "London" and "Wales", etc.
Income, "£10,000 - £15,000", etc.
Job, "Full Time" and "Part Time", etc.

I am experiencing errors when implementing. My code is as per below:

library(readxl)
iphone <- read_excel("~/Documents/iPhone_1k.xlsx")
View(iphone)

summary(iphone)
iphone

library(caTools)
library(e1071)

set.seed(101) 
sample = sample.split(iphone$Gender, SplitRatio = .7)
train = subset(iphone, sample == TRUE)
test  = subset(iphone, sample == FALSE)

nB_model <- naiveBayes(Gender ~ Region + Retailer, data = train)
pred <- predict(nB_model, test, type="raw") 

In the above scenario, I have an excel file called iPhone_1k (1,000 entries relating to people who have visited a website to buy an iPhone). Each row is a person visiting the website and the above demographics are known.

I have been trying to make the model work and have resorted to following the below link that uses only two variables (I would like to use a minimum of 4 but introduce more, if possible):

https://rpubs.com/dvorakt/144238

I want to be able to use these demographics to predict which retailer they will go to (also known for each instance in the iPhone_1k file). There are only 3 options. Can you please advise how to complete this?

P.S. Below is a screenshot of a simplified version of the data I have used to keep it simple in R. Once I get some code to work, I'll expand the number of variables and entries.

enter image description here


Solution

  • You are setting the problem incorrectly. It should be:

    naiveBayes(Retailer ~  Gender + Region + AgeGroup, data = train)    
    

    or in short

    naiveBayes(Retailer ~ ., data = train)  
    

    Also you might need to convert the columns into factors if they are characters. You can do it for all columns, right after reading from excel, by

    iphone[] <- lapply(iphone, factor)  
    

    Note that if you add numeric variables in the future, you should not apply this step on them.