Search code examples
rgeospatialpredictterra

predict the probabilities for only a single class (binary classification)


I am producing species distribution models for several hundred species with a vrt stack of 45 features totaling ~110GiB. These models predict the probability of suitable habitat on a scale of 0-1, wherein < 0.5 is unsuitable and > 0.5 is suitable. For each species' randomForest model, only a subset of features (gen. < 20) are used. Currently, predicting the predicted probabilities onto .tifs takes 4-9 hours depending on the domain and number of terms.

I have been writing the predictions directly to disk ala:

terra::predict(
   x, y, cpkgs="randomForest", type = 'prob', ncores = parallel::detectCores(), filename = 'example.tif', ...))

So I am hoping that it is possible to predict the probabilities for only a single class.

An mre example can be found in terras ?predict, the plots are added at end to illustrate the desired result.

library(terra)
library(randomForest)

logo <- rast(system.file("ex/logo.tif", package="terra"))   
names(logo) <- c("red", "green", "blue")
p <- matrix(c(48, 48, 48, 53, 50, 46, 54, 70, 84, 85, 74, 84, 95, 85, 
              66, 42, 26, 4, 19, 17, 7, 14, 26, 29, 39, 45, 51, 56, 46, 38, 31, 
              22, 34, 60, 70, 73, 63, 46, 43, 28), ncol=2)

a <- matrix(c(22, 33, 64, 85, 92, 94, 59, 27, 30, 64, 60, 33, 31, 9,
              99, 67, 15, 5, 4, 30, 8, 37, 42, 27, 19, 69, 60, 73, 3, 5, 21,
              37, 52, 70, 74, 9, 13, 4, 17, 47), ncol=2)

xy <- rbind(cbind(1, p), cbind(0, a))

# extract predictor values for points
e <- extract(logo, xy[,2:3])

# combine with response (excluding the ID column)
v <- data.frame(cbind(pa=xy[,1], e))

rm(p, a, xy, e)

### with two output variables (probabilities for each class)
v$pa <- as.factor(v$pa)
rfm2 <- randomForest(formula=pa~., data=v)
rfp <- predict(logo, rfm2, cores=2, type="prob", cpkgs="randomForest")

plot(rfp) # showing probabilities for both classes
plot(sum(rfp)) # obviously in binary classification the probabilities sum to 1
plot(subset(rfp, 2)) # interested in only one of the probabilities!!

As can be seen, the default behavior of predict is slightly redundant - not that it matters for most applications.

I believe I can speed up the process by only predicting in memory, and then writing out a single raster afterwards.

rfp <- terra::subset( 
         terra::predict(logo, rfm2, cores=2, type="prob", cpkgs="randomForest"),
 2)
# terra::writeRaster(rfp)

However, I am wondering if prediction in memory, or (hopefully) straight to disk, can be expedited by only returning the probabilities for a single class?

I do not know enough about predict to produce a tentative solution aside from subsetting the output. I reason that the actual calculation of both classes probabilities is very very fast, but suspect that it makes a notable difference given the size of the data.

(P.S. on a gis not I suspect that the more serious rate limiting step is likely pulling the values of the VRT's into memory, they are being transferred from hdd to sdd now).


Solution

  • You can create a wrapper around the randomForest predict function like this:

    f <- \(...) predict(...)[,1]
    rfp <- predict(logo, rfm2, cores=2, type="prob", cpkgs="randomForest", fun=f)