I just start trying the R package mlr, I am wondering if I can customize training set and test set. For example, all the data of a time sequence are the training set except for the last,and the last one is the test set.
Here is my example:
library(mlr)
library(survival)
data(lung)
myData2 <- lung %>%
select(time,status,age)
myData2$status = (myData2$status == 2)
myTrain <- c(1:(nrow(myData2)-1))
myTest <- nrow(myData2)
Lung data is from survival package. I just use three dimensions: time, status and age. Now, let's suppose they do not mean the patients' ages and how long they can survive. Let's say this is a ink purchase history of one customer.
age=74 means this customer bought 74 bottles of ink on that day and time=306 means the customer run out the ink after 306 days. So, I want to build up a survival model using all the data except for the last row. Then, when I have the data of the last row, which is age=58 implying the customer bought 58 bottles of ink on that day, I can make a prediction on time. A number close to 177 will be a good estimation. So, my training set and test set are fixed, which does not need to be resampled.
In addition, I need to change the hyperparameters for a comparison. Here is my code:
surv.task <- makeSurvTask(data=myData2,target=c('time','status'))
surv.lrn <- makeLearner("surv.cforest")
ps <- makeParamSet(
makeDiscreteParam('mincriterion',values=c(1.281552,2,3)),
makeDiscreteParam('ntree',values=c(100,200,300))
)
ctrl <- makeTuneControlGrid()
rdesc <- makeResampleDesc('Holdout',split=1,predict='train')
lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=rdesc,par.set=ps,
measures = list(setAggregation(cindex,train.mean)))
mod <- train(learner=lrn,task=surv.task,subset=myTrain)
surv.pred <- predict(mod,task=surv.task,subset=myTest)
surv.pred
You can see that I use split=1
in makeResampleDesc
because I have fixed training set which does not need to be resampled. measures in makeTuneWrapper
is currently not meaningful to me as I need to customize my own measures. Because of fixed data split, I can not use the functions like resample
or tuneParams
to get an evaluation on test data when using different hyperparameters.
So, my question is: when the training set and test set are fixed, can mlr provide a comprehensive compare for every hyperparameter? If so, how to do it?
Incidentally, looks like there is function makeFixedHoldoutInstance
which might can do this, just do not know how to use it. For example, I use makeFixedHoldoutInstance
in this way and I have got such error information:
> f <- makeFixedHoldoutInstance(train.inds=myTrain,test.inds=myTest,size=length(myTrain)+1)
> lrn = makeTuneWrapper(surv.lrn,control=ctrl,resampling=f,par.set=ps)
> resample(learner=lrn,task=surv.task,resampling=f)
[Resample] holdout iter 1: [Tune] Started tuning learner surv.cforest for parameter set:
Type len Def Constr Req Tunable Trafo
mincriterion discrete - - 1.281552,2,3 - TRUE -
ntree discrete - - 100,200,300 - TRUE -
With control class: TuneControlGrid
Imputation value: -0
[Tune-x] 1: mincriterion=1.281552; ntree=100
Error in resample.fun(learner2, task, resampling, measures = measures, :
Size of data set: 227 and resampling instance: 228 differ!
With makeFixedHoldoutInstance
you get the resampling you asked for.
But you can not use the same fixed resampling indices for the tuning inside the tuning wrapper and the resampling.
This is because first resample
will split the data according to the fixed holdout instance f
. Then the tuning inside the tuning wrapper will also need a resampling method to calculate the performance for a given configuration. As the tuning only sees the data after the split done by resample
it can not apply the same fixed resampling.
From reading your question I guess you don't want to use the tuneWrapper
but you want to directly tune your learner. So you should use simply tuneParams
:
tr = tuneParams(learner = surv.lrn, task = surv.task, resampling = cv2, par.set = ps, control = ctrl)
Note: This does not work on the given example because the cindex needs at least one uncensored observation and even then it does not make sense because the cindex is only meaningful for a bigger test set.