Search code examples
rnaivebayes

Predict the class variable using naiveBayes


I just tried to use naiveBayes function in e1071 package. Here is the process:

>library(e1071)
>data(iris)
>head(iris, n=5)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
>model <-naiveBayes(Species~., data = iris)
> pred <- predict(model, newdata = iris, type = 'raw')
> head(pred, n=5)
         setosa   versicolor    virginica
[1,]      1.00000 2.981309e-18 2.152373e-25
[2,]      1.00000 3.169312e-17 6.938030e-25
[3,]      1.00000 2.367113e-18 7.240956e-26
[4,]      1.00000 3.069606e-17 8.690636e-25
[5,]      1.00000 1.017337e-18 8.885794e-26

So far, everything is fine. In the next step, I tried to create a new data point and used the naivebayes model (model) to predict the class variable (Species) and I chose one of the training data points.

> test = c(5.1, 3.5, 1.4, 0.2) 
> prob <- predict(model, newdata = test, type=('raw'))

and here is the result:

> prob
        setosa versicolor virginica
[1,] 0.3333333  0.3333333 0.3333333
[2,] 0.3333333  0.3333333 0.3333333
[3,] 0.3333333  0.3333333 0.3333333
[4,] 0.3333333  0.3333333 0.3333333

and is strange. The data point I used as test is the row of iris dataset. Based on the actual data, the class variable of this data point is setosa:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa

and the naiveBayes predicted correctly:

             setosa   versicolor    virginica
   [1,]      1.00000 2.981309e-18 2.152373e-25

but when I try to predict test data point, it returns incorrect results. Why it returns four rows as predicted when I'm looking for prediction of just one data point? Am I doing wrong?


Solution

  • You need column names that correspond to your training data column names. Your training data

    test2 = iris[1,1:4]
    
    predict(model, newdata = test2, type=('raw'))
         setosa   versicolor    virginica
    [1,]      1 2.981309e-18 2.152373e-25
    

    "New" test data defined with data.frame

    test1 = data.frame(Sepal.Length = 5.1, Sepal.Width = 3.5, Petal.Length =  1.4, Petal.Width = 0.2)
    
    predict(model, newdata = test1, type=('raw'))
         setosa   versicolor    virginica
    [1,]      1 2.981309e-18 2.152373e-25
    

    If you only feed it one dimension, then it can predict via the Bayes rule.

    predict(model, newdata = data.frame(Sepal.Width = 3), type=('raw'))
    
            setosa versicolor virginica
    [1,] 0.2014921  0.3519619  0.446546
    

    If you feed it a dimension not found in the training data, you get equally likely classes. Inputting a longer vector just gives you more predictions.

    predict(model, newdata = 1, type=('raw'))
    
            setosa versicolor virginica
    [1,] 0.3333333  0.3333333 0.3333333