Search code examples
rmachine-learningr-caretxgboost

R xgboost on caret attempts to perform classification instead of regression


everyone.

first, data sample is here:

> str(train)
'data.frame':   30226 obs. of  71 variables:
 $ sal              : int  2732 2732 2732 2328 2560 3584 5632 5632 3584 2150 ...
 $ avg              : num  2392 2474 2392 2561 2763 ...
 $ med              : num  2314 2346 2314 2535 2754 ...
 $ jt_category_1    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ jt_category_2    : int  0 0 0 0 0 0 0 0 0 0 ...
 $ job_num_1        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ job_num_2        : int  0 0 0 0 0 0 0 0 0 0 ...

and more 64 variables(type of all is int, 0 or 1 binary values) 

column "sal" is label and it's Test data (70% of raw data)

I use package "caret" in R for regression, and choice method "xgbTree". I know it works for classification and regression.

The issue is, i wanna regression... but i don't know how to do

i execute the full code, the error is

Error: Metric RMSE not applicable for classification models

but i'm not trying to do classification. i wanna do regression.

type of my label(y of train function) is int and data type also checked.

is that wrong? it makes caret recognize this training as classification?

> str(train$sal)
 int [1:30226] 2732 2732 2732 2328 2560 3584 5632 5632 3584 2150 ...

> str(train_xg)
Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
  ..@ i       : int [1:181356] 0 1 2 3 4 5 6 7 8 9 ...
  ..@ p       : int [1:71] 0 30226 60452 90504 90678 90709 90962 93875 95087 96190 ...
  ..@ Dim     : int [1:2] 30226 70
  ..@ Dimnames:List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:70] "avg" "med" "jt_category_1" "jt_category_2" ...
  ..@ x       : num [1:181356] 2392 2474 2392 2561 2763 ...
  ..@ factors : list()

why misrecognize that?

do u know how to perform regression with xgboost and caret?

thank you in advance,

full code is here:

library(caret)
library(xgboost)

xgb_grid_1 = expand.grid(
  nrounds = 1000,
  max_depth = c(2, 4, 6, 8, 10),
  eta=c(0.5, 0.1, 0.07),
  gamma = 0.01,
  colsample_bytree=0.5,
  min_child_weight=1,
  subsample=0.5
)

xgb_trcontrol_1 = trainControl(
  method = "cv",
  number = 5,
  verboseIter = TRUE,
  returnData = FALSE,
  returnResamp = "all",                                                        # save losses across all models
  classProbs = TRUE,                                                           # set to TRUE for AUC to be computed
  summaryFunction = twoClassSummary,
  allowParallel = TRUE
)

    xgb_train_1 = train(
  x = as.matrix(train[ , 2:71]),
  y = as.matrix(train$sal),
  trControl = xgb_trcontrol_1,
  tuneGrid = xgb_grid_1,
  method = "xgbTree"
)

update(18.08.10)

when i delete two parameters (classProbs = TRUE, summaryFunction = twoClassSummary) of trainControl function, the result is the same...:

> xgb_grid_1 = expand.grid(
+   nrounds = 1000,
+   max_depth = c(2, 4, 6, 8, 10),
+   eta=c(0.5, 0.1, 0.07),
+   gamma = 0.01,
+   colsample_bytree=0.5,
+   min_child_weight=1,
+   subsample=0.5
+ )
> 
> xgb_trcontrol_1 = trainControl(
+   method = "cv",
+   number = 5,
+   allowParallel = TRUE
+ )
> 
> xgb_train_1 = train(
+   x = as.matrix(train[ , 2:71]),
+   y = as.matrix(train$sal),
+   trControl = xgb_trcontrol_1,
+   tuneGrid = xgb_grid_1,
+   method = "xgbTree"
+ )
Error: Metric RMSE not applicable for classification models

Solution

  • It's not strange that caret thinks you are asking for classification, because you are actually doing so in these 2 lines of your trainControl function:

    classProbs = TRUE,     
    summaryFunction = twoClassSummary
    

    Remove both these lines (so as they take their default values - see the function documentation), and you should be fine.

    Notice also that AUC is only applicable to classification problems.

    UPDATE (after comments): Seems that the target variable being integer causes the problem; convert it to double before running the model with

    train$sal <- as.double(train$sal)