Search code examples
rproximityrandom-forest

R RandomForest: Proximity for new object


I trained a random forest:

model <- randomForest(x, y, proximity=TRUE)

When I want to predict y for new objects, I use

y_pred <- predict(model, xnew)

How can I calculate the proximity between the new objects (xnew) and the training set (x) based on the already existing forest (model)? The proximity option in the predict function gives only the proxmities among the new objects (xnew). I could run randomForest unsupervised again on a combined data set (x and xnew) to get the proximities, but I think there must be some way to avoid building the forest again and instead using the already existing one.

Thanks! Kilian


Solution

  • I believe what you want is to specify your test observations in the randomForest call itself, something like this:

    set.seed(71)
    ind <- sample(1:150,140,replace = FALSE)
    train <- iris[ind,]
    test <- iris[-ind,]
    
    iris.rf1 <- randomForest(x = train[,1:4],
                             y = train[,5],
                             xtest = test[,1:4],
                             ytest = test[,5], 
                             importance=TRUE,
                             proximity=TRUE)
    
    dim(iris.rf1$test$prox)
    [1]  10 150
    

    So that gives you the proximity from the ten test cases to all 150.

    The only other option would be to call predict on your new case rbinded to the original training cases, I think. But that way you don't need to have your test cases up front with the randomForest call.

    In that case, you'll want to use keep.forest = TRUE in the randomForest call and of course set proximity = TRUE when you call predict.