I have a table comprised of tickets info. One column is the ticket #, three more columns with free-form text field that have mulitple words in english, and one last (categorical) for Group assigned to.
For simplicity I've just put Text### as the cell value, but in reality each of the Field1, Field2 and Field3 columns have multiple sentences with multiple words in english.
The data is as follows. In the same table we're provided rows with the correct group identified, and some tickets pending to be assigned to their corresponding group.
TicketID | Field1 | Field2 | Field3 | Group | DataOneY |
---|---|---|---|---|---|
00000001 | Text101 | Text102 | Text103 | B | B |
00000002 | Text101 | Text102 | Text103 | A | A |
00000003 | Text101 | Text102 | Text103 | B | B |
00000004 | Text101 | Text102 | Text103 | B | B |
00000005 | Text101 | Text102 | Text103 | C | C |
........ | ....... | ....... | ....... | ...... | ........ |
00000789 | Text101 | Text102 | Text103 | ||
00001232 | Text101 | Text102 | Text103 | ||
00012988 | Text101 | Text102 | Text103 | ||
........ | ....... | ....... | ....... | ...... | ........ |
The task at hand is, based on previous data, use SVM to predict the Group assignment by using the words in all of the free-form text fields.
So I build the VCorpus and DTM, and then start building my training, test and prediction dataframes.
The tSparse
dataframe looks like this (Ticket ID is used as row name)
Word1 | Word2 | Word3 | ..... | WordN | Group | DataOneY | |
---|---|---|---|---|---|---|---|
00000001 | 0 | 1 | 0 | ..... | 2 | B | B |
00000001 | 1 | 1 | 3 | ..... | 0 | B | B |
00000002 | 0 | 1 | 0 | ..... | 1 | B | B |
00000103 | 2 | 3 | 3 | ..... | 0 | B | B |
00000084 | 0 | 1 | 0 | ..... | 0 | B | B |
.... | ... | ... | ... | ..... | ... | ... | ... |
00001249 | 0 | 1 | 0 | ..... | 2 | ||
00023232 | 0 | 2 | 2 | ..... | 1 | ||
00000098 | 4 | 1 | 0 | ..... | 1 | ||
.... | ... | ... | ... | ..... | ... | ... | ... |
buildDocCorpus <- reactive({
#build the VCorpus and DTM
#Build general dataframe with predictions to split train and test
tSparse1_r<-tSparse%>%filter(tSparse$dataOneY!="")
#make sure output column is a factor
tSparse1_r$dataOneY<-factor(tSparse1_r$dataOneY)
#Split into training and test dataframes (sets)
trainSparse <- stratified(tSparse1_r, "dataOneY", .9, keep.rownames=TRUE)
#make sure trainSparse is a dataframe and use ticket id as index (row names)
trainSparse <- as.data.frame(trainSparse)
rownames(trainSparse) <- trainSparse$rn
trainSparse$rn <- NULL
#create test dataframe by selecting tickets whose ID doesn't appear in training
testSparse = subset(tSparse1_r, !(rownames(tSparse1_r) %in% rownames(trainSparse)))
#build predict set with rows that don't have a group assigned
PredictSparse1<-tSparse%>%filter(dataOneY==""|(is.na(dataOneY)))
PredictSparse1<-subset(PredictSparse1, select = -c(dataOneY))
return(
list(
trainSparse = trainSparse,
testSparse = testSparse,
PredictSparse = PredictSparse1
)
)
})
cfMtxSVM <- function(mymode){
#browser()
mymode = toString(mymode)
bdc <- buildDocCorpus()
trainSparse <- bdc$trainSparse
if(mymode == "test"){
mySparse <- bdc$testSparse
}
else if (mymode == "predict"){
mySparse <- bdc$PredictSparse
}
#subset.test <- test[filt,]
#rf =randomForest(dataOneY~ ., data=trainSparse)
#PredictRF = predict(rf,newdata = mySparse)
#
trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
svm_Linear <- train(dataOneY ~., data = trainSparse, method = "svmLinear", trControl=trctrl, preProcess = c("center", "scale"), tuneLength = 10)
test_svm1 <- predict(svm_Linear, newdata = mySparse)
#test_svm
return(
list(
testOneY = mySparse$dataOneY,
test_svm = test_svm1,
trainSparse = trainSparse
)
)
}
When I run the program like this:
tb1 <- cfMtxSVM(mymode = toString("predict"))
I get the following error:
Warning: Error in model.frame.default: factor Group has new level
[No stack trace available]
Of course, Group
and DataOneY
columns are all NA in the predict dataset.
From what I've investigated, it seems I need to assign levels to the Group column in the predict dataset. These are all of the attempts I've tried to do and all of them return an error:
#Attempt 1: Remove both output columns
#PredictSparse1<-subset(PredictSparse1, select = -c(Group,dataOneY))
#Attempt 2: Make PredictSpare Group column a factor
#PredictSparse1$Group<-factor(PredictSparse1$Group)
#Attempt 3: Copy Levels from trainSparse to PredictSparse
#levels(PredictSparse1$Group) <- levels(trainSparse$Group)
#Attempt 4: Like 3 but making it factor
#PredictSparse1$Failure_Mode <- factor(
# PredictSparse1$Failure_Mode,levels = levels(trainSparse$Failure_Mode)
#)
#Attempt 5: Manually specify levels and add NA that is in output column
lvls <- c('A','B','C')
PredictSparse1$Group <- sapply(PredictSparse1$Group, factor, levels=lvls)
PredictSparse1$Group <- addNA(PredictSparse1$Group)
#Attempt 6: Same as 5 but for the three datasets (train, test and predict)
I'm at my wit's end, could you please shed some light on how to get around the has new level
error.
If it helps, I also run RandomForest with this exact same train, test and predict datasets and it runs OK every time, except when I do my previously stated attempts to fix the levels error that it also breaks down.
rookie mistake!
dataOneY
and Group
were copies, so I was actually having a data leak in the model.
Once I removed Group
from the training and test datasets, and re-ran model training, I was able to get results in SVM predict
correctly.