Search code examples
rmlr

how to resample and compare the resutls when I just want to predict the last row of the data using surv. functions in mlr package, R?


I just start trying the R package mlr, I am wondering if I can customize training set and test set. For example, all the data of a time sequence are the training set except for the last,and the last one is the test set.

Here is my example:

library(mlr)
library(survival)
data(lung)
myData2 <- lung %>%
    select(time,status,age)
myData2$status = (myData2$status == 2)
myTrain <- c(1:(nrow(myData2)-1))
myTest <- nrow(myData2)

Lung data is from survival package. I just use three dimensions: time, status and age. Now, let's suppose they do not mean the patients' ages and how long they can survive. Let's say this is a ink purchase history of one customer.

age=74 means this customer bought 74 bottles of ink on that day and time=306 means the customer run out the ink after 306 days. So, I want to build up a survival model using all the data except for the last row. Then, when I have the data of the last row, which is age=58 implying the customer bought 58 bottles of ink on that day, I can make a prediction on time. A number close to 177 will be a good estimation. So, my training set and test set are fixed, which does not need to be resampled.

In addition, I need to change the hyperparameters for a comparison. Here is my code:

surv.task <- makeSurvTask(data=myData2,target=c('time','status'))
surv.lrn <- makeLearner("surv.cforest")
ps <- makeParamSet(
  makeDiscreteParam('mincriterion',values=c(1.281552,2,3)),
  makeDiscreteParam('ntree',values=c(100,200,300))
)
ctrl <- makeTuneControlGrid()
rdesc <- makeResampleDesc('Holdout',split=1,predict='train') 
lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=rdesc,par.set=ps,
                      measures = list(setAggregation(cindex,train.mean)))
mod <- train(learner=lrn,task=surv.task,subset=myTrain)
surv.pred <- predict(mod,task=surv.task,subset=myTest)
surv.pred

You can see that I use split=1 in makeResampleDesc because I have fixed training set which does not need to be resampled. measures in makeTuneWrapper is currently not meaningful to me as I need to customize my own measures. Because of fixed data split, I can not use the functions like resample or tuneParams to get an evaluation on test data when using different hyperparameters.

So, my question is: when the training set and test set are fixed, can mlr provide a comprehensive compare for every hyperparameter? If so, how to do it?

Incidentally, looks like there is function makeFixedHoldoutInstance which might can do this, just do not know how to use it. For example, I use makeFixedHoldoutInstance in this way and I have got such error information:

> f <- makeFixedHoldoutInstance(train.inds=myTrain,test.inds=myTest,size=length(myTrain)+1)
> lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=f,par.set=ps)
> resample(learner=lrn,task=surv.task,resampling=f)
[Resample] holdout iter 1: [Tune] Started tuning learner surv.cforest for parameter set:
                 Type len Def       Constr Req Tunable Trafo
mincriterion discrete   -   - 1.281552,2,3   -    TRUE     -
ntree        discrete   -   -  100,200,300   -    TRUE     -
With control class: TuneControlGrid
Imputation value: -0
[Tune-x] 1: mincriterion=1.281552; ntree=100
Error in resample.fun(learner2, task, resampling, measures = measures,  : 
  Size of data set: 227 and resampling instance: 228 differ!

Solution

  • With makeFixedHoldoutInstance you get the resampling you asked for. But you can not use the same fixed resampling indices for the tuning inside the tuning wrapper and the resampling.

    This is because first resample will split the data according to the fixed holdout instance f. Then the tuning inside the tuning wrapper will also need a resampling method to calculate the performance for a given configuration. As the tuning only sees the data after the split done by resample it can not apply the same fixed resampling.

    From reading your question I guess you don't want to use the tuneWrapper but you want to directly tune your learner. So you should use simply tuneParams:

    tr = tuneParams(learner = surv.lrn, task = surv.task, resampling = cv2, par.set = ps, control = ctrl)
    

    Note: This does not work on the given example because the cindex needs at least one uncensored observation and even then it does not make sense because the cindex is only meaningful for a bigger test set.