I want to solve the classification problem using gbm. However, when using caret, the following error occurs.
Error in {: task 1 failed-"arguments imply differing number of rows: 0, 336"
For reference, there is no NA or null value in my data. Here is my data
There is no problem when I use gbm package. If you know why this is happening when using Caret, please help me.
Below is my code and session info.
if(!require(caret)){install.packages('caret', dep=TRUE);require(caret)}
if(!require(data.table)){install.packages('data.table', dep=TRUE);require(data.table)}
if(!require(gbm)){install.packages('gbm', dep=TRUE);require(gbm)}
trainSet <- fread(file="trainSet.csv")
trainSet$result <- as.factor(trainSet$result)
fitControl <- trainControl(
method = "repeatedcv",
number = 5,
repeats = 5
)
#Error in { : task 1 failed - "arguments imply differing number of rows: 0, 336"
model_gbm_caret<-train(result~ +size_delta+inserted_line+deleted_line+size,
data = trainSet,
method='gbm',
trControl = fitControl,
verbose=TRUE)
#no error
model_gbm<-gbm(result~+size_delta+inserted_line+deleted_line+size, data=trainSet, cv.folds = 2)
session info
(64-bit) Running under: Windows Server 2008 R2 x64 (build 7601) Service Pack 1 Matrix products: default locale: [1] LC_COLLATE=Korean_Korea.949 LC_CTYPE=Korean_Korea.949 LC_MONETARY=Korean_Korea.949 LC_NUMERIC=C [5] LC_TIME=Korean_Korea.949 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] gbm_2.1.5 data.table_1.12.8 caret_6.0-86 ggplot2_3.3.0 lattice_0.20-40 loaded via a namespace (and not attached): [1] Rcpp_1.0.4 pillar_1.4.3 compiler_3.5.3 gower_0.2.1 plyr_1.8.6 [6] iterators_1.0.12 class_7.3-15 tools_3.5.3 rpart_4.1-15 packrat_0.5.0 [11] ipred_0.9-9 lubridate_1.7.4 lifecycle_0.2.0 tibble_2.1.3 nlme_3.1-137 [16] gtable_0.3.0 pkgconfig_2.0.3 rlang_0.4.5 Matrix_1.2-18 foreach_1.5.0 [21] rstudioapi_0.11 parallel_3.5.3 prodlim_2019.11.13 e1071_1.7-3 gridExtra_2.3 [26] stringr_1.4.0 withr_2.1.2 dplyr_0.8.5 pROC_1.16.2 generics_0.0.2 [31] recipes_0.1.10 stats4_3.5.3 nnet_7.3-13 grid_3.5.3 tidyselect_1.0.0 [36] glue_1.3.2 R6_2.4.1 survival_3.1-11 lava_1.6.7 reshape2_1.4.3 [41] purrr_0.3.3 magrittr_1.5 ModelMetrics_1.2.2.2 splines_3.5.3 scales_1.1.0 [46] codetools_0.2-16 MASS_7.3-51.5 rsconnect_0.8.16 assertthat_0.2.1 timeDate_3043.102 [51] colorspace_1.4-1 stringi_1.4.6 munsell_0.5.0 crayon_1.3.4 ```
Appreciate your help!
There's a few issues, if you look at what you are trying to predict, it really doesn't make sense:
library(gbm)
library(data.table)
library(caret)
trainSet <- fread("https://raw.githubusercontent.com/kyrios05/R-Machine-Learning/master/trainSet.csv")
table(trainSet$result)
1 8 9 10 11 14 15 16 17 18 19 20 22 23 24 26 28 30 31 33
3 3 3 2 24 3 8 3 4 2 12 5 41 5 3 63 5 3 4 3
36 38 39 42 43 44 46 47 48 49 50 51 52 53 54 55 56 57 58 59
3 3 2 5 6 2 2 3 28 14 4 3 5 3 3 10 8 2 6 6
60 61 62 65 67 70 72 73 74 75 76 77 79 80 81 82 83 85 87 88
5 9 10 3 5 4 813 257 6 3 9 9 2 3 3 6 2 5 3 6
90 92 93 94 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111
3 2 20 13 5 3 3 9 42 2 2 3 7 2 2 4 2 13 2 3
112 113 114 115 116 117 118 119
3 12 3 2 4 5 3 2
You are trying to run a classification on what looks like discrete values. And if I run the gbm, it runs but throws up error because there are too many label classes and too little data!
trainSet$result = factor(trainSet$result)
model_gbm<-gbm(result~+size_delta+inserted_line+deleted_line+size, data=trainSet, cv.folds = 2)
Distribution not specified, assuming multinomial ...
Warning messages:
1: In predict.gbm(model, newdata = my.data, n.trees = best.iter.cv) :
NAs introduced by coercion
2: In predict.gbm(model, newdata = my.data, n.trees = best.iter.cv) :
NAs introduced by coercion
If it is indeed classification, you can reduce it to 3 classes:
trainSet$label = as.character(trainSet$result)
trainSet$label[!trainSet$label %in% c(72,73)] <- "others"
fitControl <- trainControl(method = "cv",number=2)
model_gbm_caret<-train(label~ +size_delta+inserted_line+deleted_line+size,
data = trainSet,
method='gbm',
trControl = fitControl,
verbose=TRUE,distribution="multinomial")
Or you run a regression (which I hope is the intended):
trainSet <- fread("https://raw.githubusercontent.com/kyrios05/R-Machine-Learning/master/trainSet.csv")
fitControl <- trainControl(method = "cv",number=2)
model_gbm_caret<-train(result ~ +size_delta+inserted_line+deleted_line+size,
data = trainSet,
method='gbm',
trControl = fitControl,
verbose=TRUE)