Search code examples
rsvmr-carettraining-data

Error in if (any(co)) { : valor ausente donde TRUE/FALSE es necesario


I have been training some models and when I try to use Support Vector Machines with Radial Basis Function Kernel I get the following error:

> svmRFit <- train(x = Fraud_trainX, 
+                  y = Fraud_trainY, 
+                  method = "svmRadial",
+                  metric = "ROC",
+                  preProc = c("center", "scale"),
+                  tuneLength = 15,
+                  trControl = ctrl)
Error in if (any(co)) { : valor ausente donde TRUE/FALSE es necesario
Además: Warning messages:
1: In FUN(newX[, i], ...) : NAs introducidos por coerción
2: In FUN(newX[, i], ...) : NAs introducidos por coerción
3: In FUN(newX[, i], ...) : NAs introducidos por coerción
4: In FUN(newX[, i], ...) : NAs introducidos por coerción
5: In FUN(newX[, i], ...) : NAs introducidos por coerción
Called from: .local(x, ...)
Browse[1]>

Here is a summary of my database:

summary(Fraud_trainX)
        Make      AccidentArea                PolicyType   VehicleCategory
 Pontiac  :1412   Rural: 597   SedC                :2109   Sedan  :3660   
 Toyota   :1177   Urban:5186   SedL                :1857   Sport  :1994   
 Honda    :1054                SedA                :1551   Utility: 129   
 Mazda    : 883                SpoC                : 126                  
 Chevrolet: 637                Utility - All Perils: 113                  
 Accura   : 183                UtiCL               :  16                  
 (Other)  : 437                (Other)             :  11                  
 BasePolicy WeekOfMonthClaimed      Age         PolicyNumber     RepNumber     
 AP:1675    Min.   :1.000      Min.   :16.00   Min.   :    2   Min.   : 1.000  
 C :2246    1st Qu.:2.000      1st Qu.:31.00   1st Qu.: 3866   1st Qu.: 4.000  
 L :1862    Median :3.000      Median :38.00   Median : 7757   Median : 9.000  
            Mean   :2.703      Mean   :40.71   Mean   : 7754   Mean   : 8.473  
            3rd Qu.:4.000      3rd Qu.:49.00   3rd Qu.:11556   3rd Qu.:12.000  
            Max.   :5.000      Max.   :80.00   Max.   :15420   Max.   :16.000  
                               NA's   :130                                     
   Deductible     DriverRating     ClaimSize          Month       
 Min.   :400.0   Min.   :1.000   Min.   :     0   Min.   : 1.000  
 1st Qu.:400.0   1st Qu.:1.000   1st Qu.:  4112   1st Qu.: 3.000  
 Median :400.0   Median :3.000   Median :  8150   Median : 6.000  
 Mean   :407.3   Mean   :2.488   Mean   : 22921   Mean   : 6.384  
 3rd Qu.:400.0   3rd Qu.:3.000   3rd Qu.: 43446   3rd Qu.: 9.000  
 Max.   :700.0   Max.   :4.000   Max.   :141394   Max.   :12.000  
                 NA's   :4                                        
  WeekOfMonth      DayOfWeek     DayOfWeekClaimed  MonthClaimed   
 Min.   :1.000   Min.   :1.000   Min.   :1.000    Min.   : 1.000  
 1st Qu.:2.000   1st Qu.:2.000   1st Qu.:2.000    1st Qu.: 3.000  
 Median :3.000   Median :4.000   Median :3.000    Median : 6.000  
 Mean   :2.776   Mean   :3.844   Mean   :2.824    Mean   : 6.345  
 3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000    3rd Qu.: 9.000  
 Max.   :5.000   Max.   :7.000   Max.   :7.000    Max.   :12.000  
                                                                  
      Sex         MaritalStatus       Fault         VehiclePrice  
 Min.   :0.0000   Min.   :1.000   Min.   :0.0000   Min.   :1.000  
 1st Qu.:1.0000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
 Median :1.0000   Median :2.000   Median :0.0000   Median :2.000  
 Mean   :0.8406   Mean   :1.698   Mean   :0.2722   Mean   :2.783  
 3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
 Max.   :1.0000   Max.   :3.000   Max.   :1.0000   Max.   :6.000  
                                                                  
 Days_Policy_Accident Days_Policy_Claim PastNumberOfClaims  AgeOfVehicle  
 Min.   :0.000        Min.   :1.000     Min.   :0.000      Min.   :0.000  
 1st Qu.:4.000        1st Qu.:3.000     1st Qu.:0.000      1st Qu.:6.000  
 Median :4.000        Median :3.000     Median :1.000      Median :7.000  
 Mean   :3.971        Mean   :2.993     Mean   :1.333      Mean   :6.592  
 3rd Qu.:4.000        3rd Qu.:3.000     3rd Qu.:2.000      3rd Qu.:8.000  
 Max.   :4.000        Max.   :3.000     Max.   :3.000      Max.   :8.000  
                                                                          
 AgeOfPolicyHolder PoliceReportFiled WitnessPresent      AgentType      
 Min.   :1.00      Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:5.00      1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :6.00      Median :0.00000   Median :0.00000   Median :0.00000  
 Mean   :5.89      Mean   :0.02957   Mean   :0.00536   Mean   :0.01504  
 3rd Qu.:7.00      3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :9.00      Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
                                                                        
 NumberOfSuppliments AddressChange_Claim  NumberOfCars   
 Min.   :0.000       Min.   :0.0000      Min.   :0.0000  
 1st Qu.:0.000       1st Qu.:0.0000      1st Qu.:0.0000  
 Median :1.000       Median :0.0000      Median :0.0000  
 Mean   :1.163       Mean   :0.1757      Mean   :0.1027  
 3rd Qu.:2.000       3rd Qu.:0.0000      3rd Qu.:0.0000  
 Max.   :3.000       Max.   :3.0000      Max.   :3.0000 

The structure of the database:

str(Fraud_trainX)
'data.frame':   5783 obs. of  32 variables:
 $ Make                : Factor w/ 19 levels "Accura","BMW",..: 7 18 6 7 6 6 6 3 10 7 ...
 $ AccidentArea        : Factor w/ 2 levels "Rural","Urban": 2 1 2 1 2 2 2 2 2 2 ...
 $ PolicyType          : Factor w/ 8 levels "SedA","SedC",..: 5 3 3 2 3 3 1 2 3 2 ...
 $ VehicleCategory     : Factor w/ 3 levels "Sedan","Sport",..: 2 2 2 1 2 2 1 1 2 1 ...
 $ BasePolicy          : Factor w/ 3 levels "AP","C","L": 2 3 3 2 3 3 1 2 3 2 ...
 $ WeekOfMonthClaimed  : num  4 1 3 1 1 5 1 1 1 4 ...
 $ Age                 : num  34 65 28 NA 61 38 41 28 40 21 ...
 $ PolicyNumber        : num  2 4 13 14 15 16 17 18 21 27 ...
 $ RepNumber           : num  15 4 11 12 3 16 15 6 3 1 ...
 $ Deductible          : num  400 400 400 400 400 400 400 400 400 400 ...
 $ DriverRating        : num  4 2 1 3 1 1 4 1 1 2 ...
 $ ClaimSize           : num  59294 7584 59748 82212 59552 ...
 $ Month               : int  1 6 1 1 1 8 4 7 4 3 ...
 $ WeekOfMonth         : int  3 2 3 5 5 4 4 5 2 3 ...
 $ DayOfWeek           : int  3 6 5 5 1 2 4 7 5 4 ...
 $ DayOfWeekClaimed    : int  1 5 5 3 4 1 3 3 2 4 ...
 $ MonthClaimed        : int  1 7 1 2 2 8 5 8 5 6 ...
 $ Sex                 : int  1 1 1 1 1 1 1 0 1 1 ...
 $ MaritalStatus       : int  1 2 2 1 2 1 2 2 2 2 ...
 $ Fault               : int  0 1 0 1 0 0 0 1 0 0 ...
 $ VehiclePrice        : int  6 2 6 6 6 6 6 2 2 3 ...
 $ Days_Policy_Accident: int  4 4 4 4 4 4 4 4 4 4 ...
 $ Days_Policy_Claim   : int  3 3 3 3 3 3 3 3 3 3 ...
 $ PastNumberOfClaims  : int  0 1 1 0 0 0 0 0 1 3 ...
 $ AgeOfVehicle        : int  6 8 7 0 8 6 7 7 8 5 ...
 $ AgeOfPolicyHolder   : int  5 8 5 1 8 6 6 5 6 4 ...
 $ PoliceReportFiled   : int  1 1 0 0 0 0 0 0 0 0 ...
 $ WitnessPresent      : int  0 0 0 0 0 0 0 0 0 0 ...
 $ AgentType           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ NumberOfSuppliments : int  0 3 0 0 0 0 0 1 3 3 ...
 $ AddressChange_Claim : int  0 0 0 0 0 0 0 0 0 0 ...
 $ NumberOfCars        : int  0 0 0 0 0 0 0 0 0 0 ...

La variable respuesta:

summary(Fraud_trainY)
  No  Yes 
5440  343 

And here a little about the index and control that I use for model training:

indx <- createMultiFolds(Fraud_trainY, k = 5, times = 2)
str(indx)
ctrl <- trainControl(method = "repeatedcv",index = indx, 
                     summaryFunction = twoClassSummary,
                     sampling = "up",
                     classProbs = TRUE)

And here the model parameters:

svmRFit <- train(x = Fraud_trainX, 
                 y = Fraud_trainY, 
                 method = "svmRadial",
                 metric = "ROC",
                 preProc = c("center", "scale"),
                 tuneLength = 15,
                 trControl = ctrl)

I have already tried to load the pROC library and it has not given me any favorable results, I have already eliminated the rows that contained NA from all the variables, the response variable already has the levels "No" and "Yes". I have also done this training for C5.0 ("C5.0"), Neural Networks (nnet) and Logistic Regression ("multinom") and in all of them the data have served me and it gives me the result of the model, this is the only model that marks me some kind of error.


Solution

  • As @AlvaroMartinez commented, the error was that I had variables as factor, when I changed those variables to integer the model worked correctly.