Search code examples
random-forestsurvival-analysis

Have RMSE in Random Survival Forest in R program


I should have RMSE in three model to compare them with each other to say which one is better than the others. My models which I should run are Survival decision tree , Random survival forest and Bagging. I have been running my models but in the end I only have some predict. I brought Random survival forest result in the following. What should I do to have RMSE?

library(survival)

library(randomForestSRC)

dataset<-data.frame(data)

dataset

n.sample=round(0.5*nrow(dataset))

dataset1=sample (1: nrow(dataset),n.sample)

train=data[dataset1,]

test= data[-dataset1 ,]

set.seed(1369)

rsf0=rfsrc(Surv(time,status)~.,train,importance=TRUE,forest=T,ensemble="oob",mtry=NULL,block.size=1,splitrule="logrank")

print(rsf0)

Results:
Sample size: 821
Number of deaths: 209
Number of trees: 1000
Forest terminal node size: 15
Average no. of terminal nodes: 38.62
No. of variables tried at each split: 4
Total no. of variables: 14
Resampling used to grow trees: swor
Resample size used to grow trees: 519
Analysis: RSF
Family: surv
Splitting rule: logrank random
Number of random split points: 10
Error rate: 36.15%


Solution

  • I think you slightly misunderstand what survival analysis models are usually used for. Normally we want to predict the distribution of the survival time and not the survival time itself. The RMSE can only be used when the actual survival time is predicted. In your example, the models you discuss make a distribution prediction.

    So firstly I've cleaned up your code slightly and added an example dataset to make it reproducible:

    library(survival)
    library(randomForestSRC)
    
    # use the rats dataset to make the example reproducible
    dataset <- data.frame(survival::rats)
    dataset$sex <- factor(dataset$sex)
    
    # note that you need to set.seed before you use `sample`
    set.seed(1369)
    
    # again specifying train/test split but this time as two separate sets of integers
    train = sample(nrow(dataset), 0.5 * nrow(dataset))
    test = setdiff(seq(nrow(dataset)), train)
    
    # train the random forest model on the training data
    rsf0 = rfsrc(Surv(time,status)~., dataset[train, ], importance=TRUE, forest=T, 
    ensemble="oob", mtry=NULL, block.size=1, splitrule="logrank")
    
    # now make predictions
    predictions = predict(rsf0, newdata = dataset[-train, ])
    
    # view the predicted survival probabilities
    predictions$survival
    

    With these probabilities, you have to make a decision about how to convert them to survival time predictions, and then you have to manually compute the RMSE after first removing all censored observations. Common conversions to survival time are to take the mean of the predicted individual distributions or the median.

    As an alternative, and plugging my own package here, you could use {mlr3proba} which does this for you:

    # load required packages
    library(mlr3); library(mlr3proba);library(mlr3extralearners); library(mlr3pipelines)
    
    # use the rats dataset to make the example reproducible
    dataset <- data.frame(survival::rats)
    dataset$sex <- factor(dataset$sex)
    
    # note that you need to set.seed before you use `sample`
    set.seed(1369)
    
    # again specifying train/test split but this time as two separate sets of integers
    train = sample(nrow(dataset), 0.5 * nrow(dataset))
    test = setdiff(seq(nrow(dataset)), train)
    
    # select the random forest model and use the `crankcompositor` to automatically
    # create survival time predictions
    learn = ppl("crankcompositor", lrn("surv.rfsrc"), response = TRUE, graph_learner = TRUE)
    
    # create a task which stores your dataset
    task = TaskSurv$new("data", backend = dataset, time = "time", event = "status")
    
    # train your learner on training data
    learn$train(task, row_ids = train)
    
    # make predictions on test data
    predictions = learn$predict(task, row_ids = test)
    
    # view your survival time predictions
    predictions$response
    
    # calculate RMSE
    predictions$score(msr("surv.rmse"))
    

    This second option is more complicated if you're not used to R6, but I suspect that in your use-case it will benefit you as you can also compare multiple models at the same time with this.