I am using to Caret package (not had much experience using Caret) to train my data with Naive Bayes as outlined in the R code below. I am having an issue with the inclusion of sentences when executing "nb_model", as it produces a series of error messages, which are:
1: predictions failed for Fold1: usekernel= TRUE, fL=0, adjust=1 Error in
predict.NaiveBayes(modelFit, newdata) :
Not all variable names used in object found in newdata
2: model fit failed for Fold1: usekernel=FALSE, fL=0, adjust=1 Error in
NaiveBayes.default(x, y, usekernel = FALSE, fL = param$fL, ...) :
Please can you make a suggestion on how to adapt R code below to overcome the issue?
Dataset used in the R code below
Quick example of what the dataset looks like (10 variables):
Over arrested at in | Negative | Negative | Neutral | Neutral | Neutral | Negative |
Positive | Neutral | Negative
library(caret)
# Loading dataset
setwd("directory/path")
TrainSet = read.csv("textsent.csv", header = FALSE)
# Specifying an 80-20 train-test split
# Creating the training and testing sets
train = TrainSet[1:1200, ]
test = TrainSet[1201:1500, ]
# Declaring the trainControl function
train_ctrl = trainControl(
method = "cv", #Specifying Cross validation
number = 3, # Specifying 3-fold
)
nb_model = train(
V10 ~., # Specifying the response variable and the feature variables
method = "nb", # Specifying the model to use
data = train,
trControl = train_ctrl,
)
# Get the predictions of your model in the test set
predictions = predict(nb_model, newdata = test)
# See the confusion matrix of your model in the test set
confusionMatrix(predictions, test$V10)
The data set is all character data. Within that data there is a combination of easily encoded words (V2
- V10
) and sentences which you could do any amount of feature engineering to and generate any number of features.
To read up on text mining check out the tm
package, its docs, or blogs like hack-r.com for practical examples. Here's some Github code from the linked article.
OK so first I set stringsAsFactors = F
because of your V1
having tons of unique sentences
TrainSet <- read.csv(url("https://raw.githubusercontent.com/jcool12/dataset/master/textsentiment.csv?token=AA4LAP5VXI6I7FRKMT6HDPK6U5XBY"),
header = F,
stringsAsFactors = F)
library(caret)
Then I did feature engineering
## Feature Engineering
# V2 - V10
TrainSet[TrainSet=="Negative"] <- 0
TrainSet[TrainSet=="Positive"] <- 1
# V1 - not sure what you wanted to do with this
# but here's a simple example of what
# you could do
TrainSet$V1 <- grepl("london", TrainSet$V1) # tests if london is in the string
Then it worked, though you'll want to refine the engineering of V1
(or drop it) to get better results.
# In reality you could probably generate 20+ decent features from this text
# word count, tons of stuff... see the tm package
# Specifying an 80-20 train-test split
# Creating the training and testing sets
train = TrainSet[1:1200, ]
test = TrainSet[1201:1500, ]
# Declaring the trainControl function
train_ctrl = trainControl(
method = "cv", # Specifying Cross validation
number = 3, # Specifying 3-fold
)
nb_model = train(
V10 ~., # Specifying the response variable and the feature variables
method = "nb", # Specifying the model to use
data = train,
trControl = train_ctrl,
)
# Resampling: Cross-Validated (3 fold)
# Summary of sample sizes: 799, 800, 801
# Resampling results across tuning parameters:
#
# usekernel Accuracy Kappa
# FALSE 0.6533444 0.4422346
# TRUE 0.6633569 0.4185751
You'll get a few ignorable warnings with this basic example just because so few sentences in V1
contained the word "london". I would suggest to use that column for things like sentiment analysis, term frequency / inverse document frequency, etc.