Search code examples
rstringsvmr-caret

Good Example to model SVM String Kernel in Caret?


Here I try to model an SVM String Kernel with Caret

Using Datasets:

library(caret)
library(mlbench)
library(dplyr)
data("HouseVotes84")
dummy_data_classif <- HouseVotes84[,2:length(colnames(HouseVotes84))] %>% 
  mutate_if(is.factor, as.numeric)
dummy_data_classif <- data.frame(cbind(Class=HouseVotes84[,1], dummy_data_classif))
dummy_data_classif[is.na(dummy_data_classif)] <- 0
dummy_data_classif <- as.matrix(dummy_data_classif)
dummy_y_classif <- as.matrix(dummy_data_classif[,which(colnames(dummy_data_classif) == "Class")])
colnames(dummy_y_classif) <- "Class"
dummy_x_classif <- dummy_data_classif[,-which(colnames(dummy_data_classif) == "Class")]

data("cars") #available from caret
dummy_data_regr <- cars
dummy_data_regr <- dummy_data_regr %>%
mutate_if(is.numeric, as.character)
dummy_data_regr <- dummy_data_regr %>%
mutate_if(is.integer, as.character)
dummy_data_regr <- as.matrix(dummy_data_regr)
dummy_y_regr <- as.matrix(dummy_data_regr[,which(colnames(dummy_data_regr) == "Price")])
colnames(dummy_y_classif) <- "Price"
dummy_x_regr <- dummy_data_regr[,-which(colnames(dummy_data_regr) == "Price")]

Using Resampling

resampling <- trainControl(method = "cv",
                               number = 5,
                               allowParallel = FALSE) 

I tried to test these with 3 Methods: svmBoundrangeString, svmExpoString, svmSpectrumString

test_method <- c("svmBoundrangeString", "svmExpoString", "svmSpectrumString")
model_reg <- caret::train(x=dummy_x_regr,
                      y=dummy_y_regr, 
                      data = dummy_data, 
                      method = test_method[1], 
                      trControl = resampling)

model_cls <- caret::train(x=dummy_x_classif,
                      y=dummy_y_classif, 
                      data = dummy_data, 
                      method = test_method[1], 
                      trControl = resampling)

But this doesn't work, the Metrics are missing, if I try to do to these methods:

Something is wrong; all the Accuracy metric values are missing

 Accuracy       Kappa    
 Min.   : NA   Min.   : NA  
 1st Qu.: NA   1st Qu.: NA  
 Median : NA   Median : NA  
 Mean   :NaN   Mean   :NaN  
 3rd Qu.: NA   3rd Qu.: NA  
 Max.   : NA   Max.   : NA  
 NA's   :9     NA's   :9  

What Can I do to make it work? or perhaps these methods need a specific dataframes?


Solution

  • These three methods are string kernel based, I am not very sure how it can be used in regression, but in classification, you would have the text as the independent variable. In the case of kernlab, you would provide it as a list, see this vignette too:

    library(kernlab)
    data(reuters)
    
    head(reuters[1:2])
    [[1]]
    [1] "Computer Terminal Systems Inc said \nit has completed the sale of 200,000 shares of its common \nstock, and warrants to acquire an additional one mln shares, to \n<Sedio N.V.> of Lugano, Switzerland for 50,000 dlrs. \n    The company said the warrants are exercisable for five \nyears at a purchase price of .125 dlrs per share. \n    Computer Terminal said Sedio also has the right to buy \nadditional shares and increase its total holdings up to 40 pct \nof the Computer Terminal's outstanding common stock under \ncertain circumstances involving change of control at the \ncompany. \n    The company said if the conditions occur the warrants would \nbe exercisable at a price equal to 75 pct of its common stock's \nmarket price at the time, not to exceed 1.50 dlrs per share. \n    Computer Terminal also said it sold the technolgy rights to \nits Dot Matrix impact technology, including any future \nimprovements, to <Woodco Inc> of Houston, Tex. for 200,000 \ndlrs. But, it said it would continue to be the exclusive \nworldwide licensee of the technology for Woodco. \n    The company said the moves were part of its reorganization \nplan and would help pay current operation costs and ensure \nproduct delivery. \n    Computer Terminal makes computer generated labels, forms, \ntags and ticket printers and terminals. \n Reuter"
    
    [[2]]
    [1] "Ohio Mattress Co said its first \nquarter, ending February 28, profits may be below the 2.4 mln \ndlrs, or 15 cts a share, earned in the first quarter of fiscal \n1986. \n    The company said any decline would be due to expenses \nrelated to the acquisitions in the middle of the current \nquarter of seven licensees of Sealy Inc, as well as 82 pct of \nthe outstanding capital stock of Sealy. \n    Because of these acquisitions, it said, first quarter sales \nwill be substantially higher than last year's 67.1 mln dlrs. \n    Noting that it typically reports first quarter results in \nlate march, said the report is likely to be issued in early \nApril this year. \n    It said the delay is due to administrative considerations, \nincluding conducting appraisals, in connection with the \nacquisitions. \n Reuter"
    
     str(rlabels)
     Factor w/ 2 levels "acq","crude": 1 1 1 1 1 1 1 1 1 1 ...
    
    mdl <- ksvm(reuters,rlabels,kernel="stringdot",kpar=list(length=5,type = "boundrange"),C=3)
    

    Now if you use caret for this, you can see how it is called with getModelInfo("svmBoundrangeString"), and essentially, you provide the independent variable as a matrix with 1 column, and column names (I used cbind below):

    mdl = train(x=cbind(reuters=reuters),y=rlabels,
    method="svmBoundrangeString",trControl=trainControl(method="cv"))
    
    Support Vector Machines with Boundrange String Kernel 
    
    40 samples
     1 predictor
     2 classes: 'acq', 'crude' 
    
    No pre-processing
    Resampling: Cross-Validated (10 fold) 
    Summary of sample sizes: 36, 36, 36, 36, 36, 36, ... 
    Resampling results across tuning parameters:
    
      length  C     Accuracy  Kappa
      2       0.25  0.775     0.55 
      2       0.50  0.775     0.55 
      2       1.00  0.775     0.55 
      3       0.25  0.800     0.60 
      3       0.50  0.800     0.60 
      3       1.00  0.800     0.60 
      4       0.25  0.825     0.65 
      4       0.50  0.825     0.65 
      4       1.00  0.825     0.65 
    
    Accuracy was used to select the optimal model using the largest value.
    The final values used for the model were length = 4 and C = 0.25.