I am using the randomForest package in R, but am not partial to solutions using other packages. my RF model is using various continuous and categorical variables to predict extinction risk (Threatened, Non_Threatened). I would like to be able to show the direction of variable importance for predictors used in my RF model. Other publications have done exactly this: Figure 1 in https://www.pnas.org/content/pnas/109/9/3395.full.pdf
Any ideas on how to do something similar? One suggestion I read said to simply compare the difference between two partial dependence plots (example below), but I feel this may not be the best way. Any help would be greatly appreciated.
partialPlot(final_rf, rf_train, size_mat,"Threatened")
partialPlot(final_rf, rf_train, size_mat,"Non_Threatened")
response = Threatened
response = Non_Threatened
You could use something like an average marginal effect (or like below, an average first difference) approach.
First, I'll make some data
set.seed(11)
n = 200
p = 5
X = data.frame(matrix(runif(n * p), ncol = p))
yhat = 10 * sin(pi* X[ ,1] * X[,2]) +20 *
(X[,3] -.5)^2 + 10 * -X[ ,4] + 5 * -X[,5]
y = as.numeric((yhat+ rnorm(n)) > mean(yhat))
df <- as.data.frame(cbind(X,y))
Next, we'll estimate the RF model:
library(randomForest)
rf <- randomForest(as.factor(y) ~ ., data=df)
Net, we can loop through each variable, in each time through the loop, we're adding one standard deviation to a single x
variable for all observations. In your approach, you could also change from one category to another for categorical variables. Then, we predict the probability of a positive response under both conditions - the original condition and the one with a standard deviation added to each variable. Then we could take the difference and summarize.
nx <- names(df)
nx <- nx[-which(nx == "y")]
res <- NULL
for(i in 1:length(nx)){
p1 <- predict(rf, newdata=df, type="prob")
df2 <- df
df2[[nx[i]]] <- df2[[nx[i]]] + sd(df2[[nx[i]]])
p2 <- predict(rf, newdata=df2, type="prob")
diff <- (p2-p1)[,2]
res <- rbind(res, c(mean(diff), sd(diff)))
}
colnames(res) <- c("effect", "sd")
rownames(res) <- nx
res
# effect sd
# X1 0.11079 0.18491252
# X2 0.10265 0.16552070
# X3 0.02015 0.07951409
# X4 -0.11687 0.16671916
# X5 -0.04704 0.10274836