I am producing species distribution models for several hundred species with a vrt stack of 45 features totaling ~110GiB. These models predict the probability of suitable habitat on a scale of 0-1, wherein < 0.5 is unsuitable and > 0.5 is suitable. For each species' randomForest
model, only a subset of features (gen. < 20) are used. Currently, predicting the predicted probabilities onto .tifs takes 4-9 hours depending on the domain and number of terms.
I have been writing the predictions directly to disk ala:
terra::predict(
x, y, cpkgs="randomForest", type = 'prob', ncores = parallel::detectCores(), filename = 'example.tif', ...))
So I am hoping that it is possible to predict the probabilities for only a single class.
An mre example can be found in terras ?predict, the plots are added at end to illustrate the desired result.
library(terra)
library(randomForest)
logo <- rast(system.file("ex/logo.tif", package="terra"))
names(logo) <- c("red", "green", "blue")
p <- matrix(c(48, 48, 48, 53, 50, 46, 54, 70, 84, 85, 74, 84, 95, 85,
66, 42, 26, 4, 19, 17, 7, 14, 26, 29, 39, 45, 51, 56, 46, 38, 31,
22, 34, 60, 70, 73, 63, 46, 43, 28), ncol=2)
a <- matrix(c(22, 33, 64, 85, 92, 94, 59, 27, 30, 64, 60, 33, 31, 9,
99, 67, 15, 5, 4, 30, 8, 37, 42, 27, 19, 69, 60, 73, 3, 5, 21,
37, 52, 70, 74, 9, 13, 4, 17, 47), ncol=2)
xy <- rbind(cbind(1, p), cbind(0, a))
# extract predictor values for points
e <- extract(logo, xy[,2:3])
# combine with response (excluding the ID column)
v <- data.frame(cbind(pa=xy[,1], e))
rm(p, a, xy, e)
### with two output variables (probabilities for each class)
v$pa <- as.factor(v$pa)
rfm2 <- randomForest(formula=pa~., data=v)
rfp <- predict(logo, rfm2, cores=2, type="prob", cpkgs="randomForest")
plot(rfp) # showing probabilities for both classes
plot(sum(rfp)) # obviously in binary classification the probabilities sum to 1
plot(subset(rfp, 2)) # interested in only one of the probabilities!!
As can be seen, the default behavior of predict
is slightly redundant - not that it matters for most applications.
I believe I can speed up the process by only predicting in memory, and then writing out a single raster afterwards.
rfp <- terra::subset(
terra::predict(logo, rfm2, cores=2, type="prob", cpkgs="randomForest"),
2)
# terra::writeRaster(rfp)
However, I am wondering if prediction in memory, or (hopefully) straight to disk, can be expedited by only returning the probabilities for a single class?
I do not know enough about predict
to produce a tentative solution aside from subsetting the output. I reason that the actual calculation of both classes probabilities is very very fast, but suspect that it makes a notable difference given the size of the data.
(P.S. on a gis not I suspect that the more serious rate limiting step is likely pulling the values of the VRT's into memory, they are being transferred from hdd to sdd now).
You can create a wrapper around the randomForest predict
function like this:
f <- \(...) predict(...)[,1]
rfp <- predict(logo, rfm2, cores=2, type="prob", cpkgs="randomForest", fun=f)