Search code examples
rshinysvmlevels

r - why does my SVM function fails when I use a predict dataset but is OK with test and train datasets?


I have a table comprised of tickets info. One column is the ticket #, three more columns with free-form text field that have mulitple words in english, and one last (categorical) for Group assigned to.

For simplicity I've just put Text### as the cell value, but in reality each of the Field1, Field2 and Field3 columns have multiple sentences with multiple words in english.

The data is as follows. In the same table we're provided rows with the correct group identified, and some tickets pending to be assigned to their corresponding group.

TicketID Field1 Field2 Field3 Group DataOneY
00000001 Text101 Text102 Text103 B B
00000002 Text101 Text102 Text103 A A
00000003 Text101 Text102 Text103 B B
00000004 Text101 Text102 Text103 B B
00000005 Text101 Text102 Text103 C C
........ ....... ....... ....... ...... ........
00000789 Text101 Text102 Text103
00001232 Text101 Text102 Text103
00012988 Text101 Text102 Text103
........ ....... ....... ....... ...... ........

The task at hand is, based on previous data, use SVM to predict the Group assignment by using the words in all of the free-form text fields.

So I build the VCorpus and DTM, and then start building my training, test and prediction dataframes.

The tSparse dataframe looks like this (Ticket ID is used as row name)

Word1 Word2 Word3 ..... WordN Group DataOneY
00000001 0 1 0 ..... 2 B B
00000001 1 1 3 ..... 0 B B
00000002 0 1 0 ..... 1 B B
00000103 2 3 3 ..... 0 B B
00000084 0 1 0 ..... 0 B B
.... ... ... ... ..... ... ... ...
00001249 0 1 0 ..... 2
00023232 0 2 2 ..... 1
00000098 4 1 0 ..... 1
.... ... ... ... ..... ... ... ...
buildDocCorpus <- reactive({
    #build the VCorpus and DTM
    #Build general dataframe with predictions to split train and test
    tSparse1_r<-tSparse%>%filter(tSparse$dataOneY!="")

    #make sure output column is a factor
    tSparse1_r$dataOneY<-factor(tSparse1_r$dataOneY)
    #Split into training and test dataframes (sets)
    trainSparse <- stratified(tSparse1_r, "dataOneY", .9, keep.rownames=TRUE)
    #make sure trainSparse is a dataframe and use ticket id as index (row names)
    trainSparse <- as.data.frame(trainSparse)
    rownames(trainSparse) <- trainSparse$rn
    trainSparse$rn <- NULL
    #create test dataframe by selecting tickets whose ID doesn't appear in training
    testSparse = subset(tSparse1_r, !(rownames(tSparse1_r) %in% rownames(trainSparse)))
    #build predict set with rows that don't have a group assigned
    PredictSparse1<-tSparse%>%filter(dataOneY==""|(is.na(dataOneY)))
    PredictSparse1<-subset(PredictSparse1, select = -c(dataOneY))
    return(
          list(
            trainSparse = trainSparse,
            testSparse = testSparse,
            PredictSparse = PredictSparse1
          )
        )
      })

cfMtxSVM <- function(mymode){
    #browser()
    mymode = toString(mymode)
    bdc <- buildDocCorpus()
    trainSparse <- bdc$trainSparse
    if(mymode == "test"){
      mySparse <- bdc$testSparse
    }
    else if (mymode == "predict"){
      mySparse <- bdc$PredictSparse
    }


    #subset.test <- test[filt,]
    #rf =randomForest(dataOneY~ ., data=trainSparse)
    #PredictRF = predict(rf,newdata = mySparse)
    #
    trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
    svm_Linear <- train(dataOneY ~., data = trainSparse, method = "svmLinear", trControl=trctrl, preProcess = c("center", "scale"), tuneLength = 10)
    test_svm1 <- predict(svm_Linear, newdata = mySparse)

    #test_svm
    return(
      list(
        testOneY = mySparse$dataOneY,
        test_svm = test_svm1,
        trainSparse = trainSparse
      )
    )
  }

When I run the program like this:

tb1 <- cfMtxSVM(mymode =  toString("predict"))

I get the following error:

Warning: Error in model.frame.default: factor Group has new level 

[No stack trace available]

Of course, Group and DataOneY columns are all NA in the predict dataset.

From what I've investigated, it seems I need to assign levels to the Group column in the predict dataset. These are all of the attempts I've tried to do and all of them return an error:

#Attempt 1: Remove both output columns
#PredictSparse1<-subset(PredictSparse1, select = -c(Group,dataOneY))

#Attempt 2: Make PredictSpare Group column a factor
#PredictSparse1$Group<-factor(PredictSparse1$Group)

#Attempt 3: Copy Levels from trainSparse to PredictSparse
#levels(PredictSparse1$Group) <- levels(trainSparse$Group)

#Attempt 4: Like 3 but making it factor
#PredictSparse1$Failure_Mode <- factor(
#  PredictSparse1$Failure_Mode,levels = levels(trainSparse$Failure_Mode)
#)

#Attempt 5: Manually specify levels and add NA that is in output column
lvls <- c('A','B','C')
PredictSparse1$Group <-  sapply(PredictSparse1$Group, factor, levels=lvls)
PredictSparse1$Group <- addNA(PredictSparse1$Group)

#Attempt 6: Same as 5 but for the three datasets (train, test and predict)

I'm at my wit's end, could you please shed some light on how to get around the has new level error.

If it helps, I also run RandomForest with this exact same train, test and predict datasets and it runs OK every time, except when I do my previously stated attempts to fix the levels error that it also breaks down.


Solution

  • rookie mistake!

    dataOneY and Group were copies, so I was actually having a data leak in the model.

    Once I removed Group from the training and test datasets, and re-ran model training, I was able to get results in SVM predict correctly.