Search code examples
rlisttypestypeerrorrandom-forest

Remove original data from regression_forest without changing the type to "list"


I would like to estimate a regression_forest with the grf package and remove the original data that is stored in the regression_forest output for data protection reasons.

The problem is that when I remove the data, R doesn't recognize the object as a regression_forest anymore and therefore throws an error.

Does anyone know how to go around this problem?

Here is a reproducible example:

library(grf)

# Train a standard regression forest.
n <- 50
p <- 10
X <- matrix(rnorm(n * p), n, p)
Y <- X[, 1] * rnorm(n)
r.forest <- regression_forest(X, Y)

# Remove the original data
r.forest <- r.forest[-c(18,19)]

# Predict using the forest.
X.test <- matrix(0, 101, p)
X.test[, 1] <- seq(-2, 2, length.out = 101)
r.pred <- predict(r.forest, X.test)

The last line causes the following error:

Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "list"


Solution

  • The predict function seems to need to know the dimensions of the original data, but from what I can tell it doesn't need the data itself.

    If you convert the original data stored in the model object to NA, then the predictions seem unaffected.

    # Get original predictions
    r.pred.original <- predict(r.forest, X.test)
    
    # Convert stored data to NA
    r.forest$X.orig[!is.na(r.forest$X.orig)] <- NA
    
    # Get new predictions
    r.pred.new <- predict(r.forest, X.test)
    
    # r.pred.original and r.pred.new are the same