I've got a species distribution project going on with presence and pseudo-absence/background points. I've set up a data frame and trained a model using caret::train with 10 k-folds, and I'm making life easier with method="ranger". So, now I've got an averaged-out ranger model.
Now, here's where I'm hitting a snag. I've got this stack raster file with bioclimatic stuff (WorldClim), orographic data (elevation, slope, and such), plus a couple of categorical rasters (Land use and geology type). The plan is to use terra::predict to get a raster showing presence probabilities.
But here's the catch: when I run the predict function, it's giving me trouble. It either doesn't run at all or spits out an error message about "Missing data in columns" I've checked, and my stack rasters are all good; they worked fine when I was using a simpler *randomForest *without k-folds.
I've tried looking for other prediction methods and played around with how I'm feeding the data, but no luck so far. Anyone got any ideas or tips to help me sort this out?
Objective: Generate a raster of probabilities for the distribution of a species using a pseudo-absence and a RandomForest. Then use the model to predict current and future distribution only changing climatic rasters.
Example data
library(terra)
library(caret)
library(tuneRanger)
library(ranger)
logo <- rast(system.file("ex/logo.tif", package="terra"))
logo[75:77, ] <- NA
p <- matrix(c(48, 48, 48, 53, 50, 46, 54, 70, 84, 85, 74, 84, 95, 85,
66, 42, 26, 4, 19, 17, 7, 14, 26, 29, 39, 45, 51, 56, 46, 38, 31,
22, 34, 60, 70, 73, 63, 46, 43, 28), ncol=2)
a <- matrix(c(22, 33, 64, 85, 92, 94, 59, 27, 30, 64, 60, 33, 31, 9,
99, 67, 15, 5, 4, 30, 8, 37, 42, 27, 19, 69, 60, 73, 3, 5, 21,
37, 52, 70, 74, 9, 13, 4, 17, 47), ncol=2)
xy <- rbind(cbind(1, p), cbind(0, a))
e <- extract(logo, xy[,2:3])
v <- data.frame(cbind(pa=xy[,1], e))
Make the model
v_NA_kNN <- caret::preProcess(v, method="bagImpute")
v_rf <- predict(v_NA_kNN,v)
v_rf$pa <- as.factor(v_rf$pa)
levels(v_rf$pa) <- c("Pres","Abs")
rf.task <- makeClassifTask(data = v_rf, target = "pa")
res <- tuneRanger(rf.task, measure = list(multiclass.brier), num.trees = 1e+02,
num.threads = 4, iters = 20, save.file.path = NULL)
fitControl <- caret::trainControl(
method = "repeatedcv",
number = 5,
repeats = 5,
allowParallel = T,
classProbs=T,
returnData = T,
savePredictions = "final"
)
ranger_model <- caret::train(
v_rf[,-1],
as.factor(v_rf$pa), #This way factor is not separated by levels
method = "ranger",
trControl = fitControl,
tuneGrid = expand.grid(mtry = res$recommended.pars[,1],
min.node.size = res$recommended.pars[,2],
splitrule = "gini"),
num.trees = 1e+02,
num.threads = 4,
importance = 'impurity'
)
predict
predfun <- function(...) predict(...)$predictions
x <- terra::predict(logo, ranger_model, fun=predfun)
# Error in predict(...)$predictions :
# $ operator is invalid for atomic vectors
# Called from: fun(model, d, ...)
If you run terra::predict
with default arguments you get:
x <- terra::predict(logo, ranger_model)
#Error: Missing data in columns: red, green, blue.
You can fix that by using na.rm=TRUE
x <- terra::predict(logo, ranger_model, na.rm=TRUE)
You do not need to supply a specialized predict function because the caret predict function returns a simple vector
predict(ranger_model, logo[1:4])
#[1] Pres Pres Pres Pres
#Levels: Pres Abs
In contrast, if you used the predict function from "ranger", you would get a list, and in that case you could use
predfun <- function(...) predict(...)$predictions
x <- terra::predict(logo, ranger_model, fun=predfun, na.rm=TRUE)