predict the probabilities for only a single class (binary classification)

I am producing species distribution models for several hundred species with a vrt stack of 45 features totaling ~110GiB. These models predict the probability of suitable habitat on a scale of 0-1, wherein < 0.5 is unsuitable and > 0.5 is suitable. For each species' randomForest model, only a subset of features (gen. < 20) are used. Currently, predicting the predicted probabilities onto .tifs takes 4-9 hours depending on the domain and number of terms.

I have been writing the predictions directly to disk ala:

terra::predict(
   x, y, cpkgs="randomForest", type = 'prob', ncores = parallel::detectCores(), filename = 'example.tif', ...))

So I am hoping that it is possible to predict the probabilities for only a single class.

An mre example can be found in terras ?predict, the plots are added at end to illustrate the desired result.

library(terra)
library(randomForest)

logo <- rast(system.file("ex/logo.tif", package="terra"))   
names(logo) <- c("red", "green", "blue")
p <- matrix(c(48, 48, 48, 53, 50, 46, 54, 70, 84, 85, 74, 84, 95, 85, 
              66, 42, 26, 4, 19, 17, 7, 14, 26, 29, 39, 45, 51, 56, 46, 38, 31, 
              22, 34, 60, 70, 73, 63, 46, 43, 28), ncol=2)

a <- matrix(c(22, 33, 64, 85, 92, 94, 59, 27, 30, 64, 60, 33, 31, 9,
              99, 67, 15, 5, 4, 30, 8, 37, 42, 27, 19, 69, 60, 73, 3, 5, 21,
              37, 52, 70, 74, 9, 13, 4, 17, 47), ncol=2)

xy <- rbind(cbind(1, p), cbind(0, a))

# extract predictor values for points
e <- extract(logo, xy[,2:3])

# combine with response (excluding the ID column)
v <- data.frame(cbind(pa=xy[,1], e))

rm(p, a, xy, e)

### with two output variables (probabilities for each class)
v$pa <- as.factor(v$pa)
rfm2 <- randomForest(formula=pa~., data=v)
rfp <- predict(logo, rfm2, cores=2, type="prob", cpkgs="randomForest")

plot(rfp) # showing probabilities for both classes
plot(sum(rfp)) # obviously in binary classification the probabilities sum to 1
plot(subset(rfp, 2)) # interested in only one of the probabilities!!

As can be seen, the default behavior of predict is slightly redundant - not that it matters for most applications.

I believe I can speed up the process by only predicting in memory, and then writing out a single raster afterwards.

rfp <- terra::subset( 
         terra::predict(logo, rfm2, cores=2, type="prob", cpkgs="randomForest"),
 2)
# terra::writeRaster(rfp)

However, I am wondering if prediction in memory, or (hopefully) straight to disk, can be expedited by only returning the probabilities for a single class?

I do not know enough about predict to produce a tentative solution aside from subsetting the output. I reason that the actual calculation of both classes probabilities is very very fast, but suspect that it makes a notable difference given the size of the data.

(P.S. on a gis not I suspect that the more serious rate limiting step is likely pulling the values of the VRT's into memory, they are being transferred from hdd to sdd now).

Solution

You can create a wrapper around the randomForest predict function like this:

f <- \(...) predict(...)[,1]
rfp <- predict(logo, rfm2, cores=2, type="prob", cpkgs="randomForest", fun=f)