I should have RMSE in three model to compare them with each other to say which one is better than the others. My models which I should run are Survival decision tree , Random survival forest and Bagging. I have been running my models but in the end I only have some predict. I brought Random survival forest result in the following. What should I do to have RMSE?
library(survival)
library(randomForestSRC)
dataset<-data.frame(data)
dataset
n.sample=round(0.5*nrow(dataset))
dataset1=sample (1: nrow(dataset),n.sample)
train=data[dataset1,]
test= data[-dataset1 ,]
set.seed(1369)
rsf0=rfsrc(Surv(time,status)~.,train,importance=TRUE,forest=T,ensemble="oob",mtry=NULL,block.size=1,splitrule="logrank")
print(rsf0)
Results:
Sample size: 821
Number of deaths: 209
Number of trees: 1000
Forest terminal node size: 15
Average no. of terminal nodes: 38.62
No. of variables tried at each split: 4
Total no. of variables: 14
Resampling used to grow trees: swor
Resample size used to grow trees: 519
Analysis: RSF
Family: surv
Splitting rule: logrank random
Number of random split points: 10
Error rate: 36.15%
I think you slightly misunderstand what survival analysis models are usually used for. Normally we want to predict the distribution of the survival time and not the survival time itself. The RMSE can only be used when the actual survival time is predicted. In your example, the models you discuss make a distribution prediction.
So firstly I've cleaned up your code slightly and added an example dataset to make it reproducible:
library(survival)
library(randomForestSRC)
# use the rats dataset to make the example reproducible
dataset <- data.frame(survival::rats)
dataset$sex <- factor(dataset$sex)
# note that you need to set.seed before you use `sample`
set.seed(1369)
# again specifying train/test split but this time as two separate sets of integers
train = sample(nrow(dataset), 0.5 * nrow(dataset))
test = setdiff(seq(nrow(dataset)), train)
# train the random forest model on the training data
rsf0 = rfsrc(Surv(time,status)~., dataset[train, ], importance=TRUE, forest=T,
ensemble="oob", mtry=NULL, block.size=1, splitrule="logrank")
# now make predictions
predictions = predict(rsf0, newdata = dataset[-train, ])
# view the predicted survival probabilities
predictions$survival
With these probabilities, you have to make a decision about how to convert them to survival time predictions, and then you have to manually compute the RMSE after first removing all censored observations. Common conversions to survival time are to take the mean of the predicted individual distributions or the median.
As an alternative, and plugging my own package here, you could use {mlr3proba} which does this for you:
# load required packages
library(mlr3); library(mlr3proba);library(mlr3extralearners); library(mlr3pipelines)
# use the rats dataset to make the example reproducible
dataset <- data.frame(survival::rats)
dataset$sex <- factor(dataset$sex)
# note that you need to set.seed before you use `sample`
set.seed(1369)
# again specifying train/test split but this time as two separate sets of integers
train = sample(nrow(dataset), 0.5 * nrow(dataset))
test = setdiff(seq(nrow(dataset)), train)
# select the random forest model and use the `crankcompositor` to automatically
# create survival time predictions
learn = ppl("crankcompositor", lrn("surv.rfsrc"), response = TRUE, graph_learner = TRUE)
# create a task which stores your dataset
task = TaskSurv$new("data", backend = dataset, time = "time", event = "status")
# train your learner on training data
learn$train(task, row_ids = train)
# make predictions on test data
predictions = learn$predict(task, row_ids = test)
# view your survival time predictions
predictions$response
# calculate RMSE
predictions$score(msr("surv.rmse"))
This second option is more complicated if you're not used to R6, but I suspect that in your use-case it will benefit you as you can also compare multiple models at the same time with this.