Search code examples
rr-caret

Problem when training Naive Bayes model in R


I am using to Caret package (not had much experience using Caret) to train my data with Naive Bayes as outlined in the R code below. I am having an issue with the inclusion of sentences when executing "nb_model", as it produces a series of error messages, which are:

1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in 
predict.NaiveBayes(modelFit, newdata) : 
Not all variable names used in object found in newdata

2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in 
NaiveBayes.default(x, y, usekernel = FALSE, fL = param$fL, ...) : 

Please can you make a suggestion on how to adapt R code below to overcome the issue?

Dataset used in the R code below

Quick example of what the dataset looks like (10 variables):

  Over arrested at in | Negative | Negative | Neutral | Neutral | Neutral | Negative |
  Positive | Neutral | Negative
library(caret)

# Loading dataset
setwd("directory/path")
TrainSet = read.csv("textsent.csv", header = FALSE)

# Specifying an 80-20 train-test split
# Creating the training and testing sets
train = TrainSet[1:1200, ]
test = TrainSet[1201:1500, ]

# Declaring the trainControl function
train_ctrl = trainControl(
  method  = "cv", #Specifying Cross validation
  number  = 3, # Specifying 3-fold
)

nb_model = train(
  V10 ~., # Specifying the response variable and the feature variables
  method = "nb", # Specifying the model to use
  data = train, 
  trControl = train_ctrl,
)

# Get the predictions of your model in the test set
predictions = predict(nb_model, newdata = test)

# See the confusion matrix of your model in the test set
confusionMatrix(predictions, test$V10)

Solution

  • The data set is all character data. Within that data there is a combination of easily encoded words (V2 - V10) and sentences which you could do any amount of feature engineering to and generate any number of features.

    To read up on text mining check out the tm package, its docs, or blogs like hack-r.com for practical examples. Here's some Github code from the linked article.

    OK so first I set stringsAsFactors = F because of your V1 having tons of unique sentences

    TrainSet <- read.csv(url("https://raw.githubusercontent.com/jcool12/dataset/master/textsentiment.csv?token=AA4LAP5VXI6I7FRKMT6HDPK6U5XBY"),
                         header = F,
                         stringsAsFactors = F)
    
    library(caret)
    

    Then I did feature engineering

    ## Feature Engineering
    # V2 - V10
    TrainSet[TrainSet=="Negative"] <- 0
    TrainSet[TrainSet=="Positive"] <- 1
    
    # V1 - not sure what you wanted to do with this
    #     but here's a simple example of what 
    #     you could do
    TrainSet$V1 <- grepl("london", TrainSet$V1) # tests if london is in the string
    

    Then it worked, though you'll want to refine the engineering of V1 (or drop it) to get better results.

    # In reality you could probably generate 20+ decent features from this text
    #  word count, tons of stuff... see the tm package
    
    # Specifying an 80-20 train-test split
    # Creating the training and testing sets
    train = TrainSet[1:1200, ]
    test = TrainSet[1201:1500, ]
    
    # Declaring the trainControl function
    train_ctrl = trainControl(
      method  = "cv", # Specifying Cross validation
      number  = 3,    # Specifying 3-fold
    )
    
    nb_model = train(
      V10 ~., # Specifying the response variable and the feature variables
      method = "nb", # Specifying the model to use
      data = train, 
      trControl = train_ctrl,
    )
    
    # Resampling: Cross-Validated (3 fold) 
    # Summary of sample sizes: 799, 800, 801 
    # Resampling results across tuning parameters:
    #   
    #   usekernel  Accuracy   Kappa    
    # FALSE      0.6533444  0.4422346
    # TRUE      0.6633569  0.4185751
    

    You'll get a few ignorable warnings with this basic example just because so few sentences in V1 contained the word "london". I would suggest to use that column for things like sentiment analysis, term frequency / inverse document frequency, etc.