Search code examples
rclassificationnaivebayes

e1071 Package: naiveBayes prediction is slow


I am trying to run the naiveBayes classifier from the R package e1071. I am running into an issue where the time it takes to predict takes longer than the time it takes to train, by a factor of ~300.

I was wondering if anyone else has observed this behavior and, if so, if you have any suggestions on how to improve it.

This issue appears only in some instances. Below, I have code that trains and predicts the NB classifier on the Iris dataset. Here the training and prediction times match up quite closely (prediction takes 10x longer instead of 300x longer). The only other trace of this issue that I could find online is here. In that instance, the answer was to make sure that categorical variables are formatted as factors. I have done this, but still don't see any improvement.

I have played around with the sample size N and the problem seems to be lessened as N decreases. Perhaps this is intended behavior of the algorithm? Decreasing N by a factor of 10 causes the prediction to be only 150x slower, but increasing by a factor of 10 yields a similar slowdown of 300x. These numbers seem crazy to me, especially because I've used this algorithm in the past on datasets with ~300,000 examples and found it to be quite fast. Something seems fishy but I can't figure out what.

I'm using R version 3.3.1 on Linux. The e1071 package is up-to-date (2015 release).

The code below should be reproducible on any machine. FYI my machine timed the Iris classification at 0.003s, the Iris prediction at 0.032s, the simulated data classification at 0.045s, and the resulting prediction at 15.205s. If you get different numbers than these, please let me know as it could be some issue on my local machine.

# Remove everything from the environment and clear out memory
rm(list = ls())
gc()

# Load required packages and datasets
require(e1071)
data(iris)

# Custom function: tic/toc function to time the execution
tic <- function(gcFirst = TRUE, type=c("elapsed", "user.self", "sys.self"))
{
  type <- match.arg(type)
  assign(".type", type, envir=baseenv())
  if(gcFirst) gc(FALSE)
  tic <- proc.time()[type]         
  assign(".tic", tic, envir=baseenv())
  invisible(tic)
}

toc <- function()
{
  type <- get(".type", envir=baseenv())
  toc <- proc.time()[type]
  tic <- get(".tic", envir=baseenv())
  print(toc - tic)
  invisible(toc)
}

# set seed for reproducibility
set.seed(12345)

#---------------------------------
# 1. Naive Bayes on Iris data
#---------------------------------
tic()
model.nb.iris <- naiveBayes(Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris)
toc()
tic()
pred.nb.iris <- predict(model.nb.iris, iris, type="raw")
toc()

#---------------------------------
# 2. Simulate data and reproduce NB error
#---------------------------------
# Hyperparameters
L <- 5   # no. of locations
N <- 1e4*L

# Data
married        <- 1*(runif(N,0.0,1.0)>.45)
kids           <- 1*(runif(N,0.0,1.0)<.22)
birthloc       <- sample(1:L,N,TRUE)
major          <- 1*(runif(N,0.0,1.0)>.4)
exper          <- 15+4*rnorm(N)
exper[exper<0] <- 0
migShifter     <- 2*runif(N,0.0,1.0)-1
occShifter     <- 2*runif(N,0.0,1.0)-1
X <- data.frame(rep.int(1,N),birthloc,migShifter,occShifter,major,married,kids,exper,exper^2,exper^3)
colnames(X)[1] <- "constant"
rm(married)
rm(kids)
rm(birthloc)
rm(major)
rm(exper)
rm(occShifter)

# Parameters and errors
Gamma <- 15*matrix(runif(7*L), nrow=7, ncol=L)
eps <- matrix(rnorm(N*L, 0, 1), nrow=N, ncol=L)

# Deterministic portion of probabilities
u <- matrix(rep.int(0,N*L), nrow=N, ncol=L)
for (l in 1:L) {
    u[ ,l] = (X$birthloc==l)*Gamma[1,l] +
    X$major*Gamma[2,l]         + X$married*Gamma[3,l]              
    X$kids*Gamma[4,l]          + X$exper*Gamma[5,l]              
    X$occShifter*Gamma[6,l]    + X$migShifter*X$married*Gamma[7,l]
    eps[ ,l]
}

choice <- apply(u, 1, which.max)

# Add choice to data frame
dat <- cbind(choice,X)

# factorize categorical variables for estimation
dat$major      <- as.factor(dat$major)
dat$married    <- as.factor(dat$married)
dat$kids       <- as.factor(dat$kids)
dat$birthloc   <- as.factor(dat$birthloc)
dat$choice     <- as.factor(dat$choice)

tic()
model.nb <- naiveBayes(choice~birthloc+major+married+kids+exper+occShifter+migShifter,data=dat,laplace=3)
toc()
tic()
pred.nb <- predict(model.nb, dat, type="raw")
toc()

Solution

  • I ran into the same problem. I needed to run naive bayes and predict a lot of times (1000's of times) on some big matrices (10000 rows, 1000-2000 cols). Since I had some time, I decided to implement my own implementation of naive bayes to make it a little faster:

    https://cran.r-project.org/web/packages/fastNaiveBayes/index.html

    I made some work out of this and created a package out of it: https://cran.r-project.org/web/packages/fastNaiveBayes/index.html. It is now around 330 times faster using a Bernoulli event model. Moreover, it implements a multinomial event model (even a bit faster) and a Gaussian model (slightly faster). Finally, a mixed model where it's possible to use different event models for different columns and combine them!

    The reason e1071 is so slow in the predict function, is cause they use essentially a double for loop. There was already a pull request open from around beginning 2017 that at least vectorized one of these, but was not accepted yet.