Search code examples
rpackager-caretgbm

An error occurs when training with R caret GBM. "Error in { : task 1 failed - "arguments imply differing number of rows"


I want to solve the classification problem using . However, when using , the following error occurs.

Error in {:    task 1 failed-"arguments imply differing number of rows: 0, 336"

For reference, there is no NA or null value in my data. Here is my data

There is no problem when I use package. If you know why this is happening when using Caret, please help me.

Below is my code and session info.

if(!require(caret)){install.packages('caret', dep=TRUE);require(caret)}
if(!require(data.table)){install.packages('data.table', dep=TRUE);require(data.table)}
if(!require(gbm)){install.packages('gbm', dep=TRUE);require(gbm)}

trainSet <- fread(file="trainSet.csv")

trainSet$result <- as.factor(trainSet$result)

fitControl <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 5
) 

#Error in { : task 1 failed - "arguments imply differing number of rows: 0, 336"
model_gbm_caret<-train(result~ +size_delta+inserted_line+deleted_line+size, 
                       data = trainSet, 
                       method='gbm', 
                       trControl = fitControl,
                       verbose=TRUE)

#no error
model_gbm<-gbm(result~+size_delta+inserted_line+deleted_line+size, data=trainSet, cv.folds = 2)

session info

(64-bit) Running under: Windows Server 2008 R2 x64 (build 7601)
Service Pack 1

Matrix products: default

locale: [1] LC_COLLATE=Korean_Korea.949  LC_CTYPE=Korean_Korea.949   
LC_MONETARY=Korean_Korea.949 LC_NUMERIC=C                 [5]
LC_TIME=Korean_Korea.949    

attached base packages: [1] stats     graphics  grDevices utils    
datasets  methods   base     

other attached packages: [1] gbm_2.1.5         data.table_1.12.8
caret_6.0-86      ggplot2_3.3.0     lattice_0.20-40  

loaded via a namespace (and not attached):  [1] Rcpp_1.0.4          
pillar_1.4.3         compiler_3.5.3       gower_0.2.1         
plyr_1.8.6            [6] iterators_1.0.12     class_7.3-15        
tools_3.5.3          rpart_4.1-15         packrat_0.5.0        [11]
ipred_0.9-9          lubridate_1.7.4      lifecycle_0.2.0     
tibble_2.1.3         nlme_3.1-137         [16] gtable_0.3.0        
pkgconfig_2.0.3      rlang_0.4.5          Matrix_1.2-18       
foreach_1.5.0        [21] rstudioapi_0.11      parallel_3.5.3      
prodlim_2019.11.13   e1071_1.7-3          gridExtra_2.3        [26]
stringr_1.4.0        withr_2.1.2          dplyr_0.8.5         
pROC_1.16.2          generics_0.0.2       [31] recipes_0.1.10      
stats4_3.5.3         nnet_7.3-13          grid_3.5.3          
tidyselect_1.0.0     [36] glue_1.3.2           R6_2.4.1            
survival_3.1-11      lava_1.6.7           reshape2_1.4.3       [41]
purrr_0.3.3          magrittr_1.5         ModelMetrics_1.2.2.2
splines_3.5.3        scales_1.1.0         [46] codetools_0.2-16    
MASS_7.3-51.5        rsconnect_0.8.16     assertthat_0.2.1    
timeDate_3043.102    [51] colorspace_1.4-1     stringi_1.4.6       
munsell_0.5.0        crayon_1.3.4  ```

Appreciate your help!


Solution

  • There's a few issues, if you look at what you are trying to predict, it really doesn't make sense:

    library(gbm)
    library(data.table)
    library(caret)
    
    trainSet <- fread("https://raw.githubusercontent.com/kyrios05/R-Machine-Learning/master/trainSet.csv")
    
    table(trainSet$result)
    
      1   8   9  10  11  14  15  16  17  18  19  20  22  23  24  26  28  30  31  33 
      3   3   3   2  24   3   8   3   4   2  12   5  41   5   3  63   5   3   4   3 
     36  38  39  42  43  44  46  47  48  49  50  51  52  53  54  55  56  57  58  59 
      3   3   2   5   6   2   2   3  28  14   4   3   5   3   3  10   8   2   6   6 
     60  61  62  65  67  70  72  73  74  75  76  77  79  80  81  82  83  85  87  88 
      5   9  10   3   5   4 813 257   6   3   9   9   2   3   3   6   2   5   3   6 
     90  92  93  94  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 
      3   2  20  13   5   3   3   9  42   2   2   3   7   2   2   4   2  13   2   3 
    112 113 114 115 116 117 118 119 
      3  12   3   2   4   5   3   2 
    

    You are trying to run a classification on what looks like discrete values. And if I run the gbm, it runs but throws up error because there are too many label classes and too little data!

    trainSet$result = factor(trainSet$result)
    
    model_gbm<-gbm(result~+size_delta+inserted_line+deleted_line+size, data=trainSet, cv.folds = 2)
    Distribution not specified, assuming multinomial ...
    Warning messages:
    1: In predict.gbm(model, newdata = my.data, n.trees = best.iter.cv) :
      NAs introduced by coercion
    2: In predict.gbm(model, newdata = my.data, n.trees = best.iter.cv) :
      NAs introduced by coercion
    

    If it is indeed classification, you can reduce it to 3 classes:

    trainSet$label = as.character(trainSet$result)
    trainSet$label[!trainSet$label %in% c(72,73)] <- "others"
    
    fitControl <- trainControl(method = "cv",number=2) 
    model_gbm_caret<-train(label~ +size_delta+inserted_line+deleted_line+size, 
                           data = trainSet, 
                           method='gbm', 
                           trControl = fitControl,
                           verbose=TRUE,distribution="multinomial")
    

    Or you run a regression (which I hope is the intended):

    trainSet <- fread("https://raw.githubusercontent.com/kyrios05/R-Machine-Learning/master/trainSet.csv")
    fitControl <- trainControl(method = "cv",number=2) 
    model_gbm_caret<-train(result ~ +size_delta+inserted_line+deleted_line+size, 
                           data = trainSet, 
                           method='gbm', 
                           trControl = fitControl,
                           verbose=TRUE)