I want to use the terra
package for spatial prediction, applying a model that uses a lot of memory internally. The chunk sizes that terra uses are too large for this model to run, causing the R session to crash from out of memory. In my specific use case my real function creates a large numpy array for a TensorFlow deep learning model, but this issue will apply to any situation where the model function's memory requirement scales strongly with chunk size.
Normally, we can indirectly affect the chunk size by setting memfrac
in terraOptions()
. However, there seems to be minimal threshold of memfrac at which this ceases to work.
Here's a simple reproducible example:
First let's define a dummy model function that we want to apply, which has a high memory requirement.
f = function(x, y, z) {
a = array(rnorm(length(x)), c(length(x), length(y), length(z)))
return(rep(sum(a), length(x)))
}
And some dummy raster data to apply the model to:
library(terra)
x = rast(matrix(runif(1e7), 1e3, 1e4))
y = rast(matrix(runif(1e7), 1e3, 1e4))
z = rast(matrix(runif(1e7), 1e3, 1e4))
in_data = rast(list(x,y,z))
rm(x,y,z); gc()
We can see that with the default options, terra will try to process this data in a single chunk:
mem_info(in_data)
# ------------------------
# Memory (GB)
# ------------------------
# check threshold : 1 (memmin)
# available : 107.96
# allowed (60%) : 64.78
# needed (n=1) : 0.22
# ------------------------
# proc in memory : TRUE
# nr chunks : 1
# ------------------------
However small we make memfrac
, terra still wants to work in a single chunk. There seems to be a minimum value of memfrac below which it no longer forces smaller chunks. E,g, in the below we can see that one chunk is used even though needed memory is larger than allowed
terraOptions(memfrac=0.001)
mem_info(in_data)
# ------------------------
# Memory (GB)
# ------------------------
# check threshold : 1 (memmin)
# available : 107.97
# allowed (0%) : 0.11
# needed (n=1) : 0.22
# ------------------------
# proc in memory : TRUE
# nr chunks : 1
# ------------------------
Applying the model as below on a single chunk will result in running out of memory
lapp(in_data, f, filename='test.tif')
My question is, is there any way to force a specific chunk size in terra when we need a very small memfrac
value?
Note that setting memmin
and memmax
in terraOtions
does not help. Indeed, it appears that any value of memmin
less than 1 Gb is ignored.
terraOptions(memmax=0.01)
terraOptions(memmin=0.001)
terraOptions()
#tempdir : /tmp/RtmpTYLIGS
#todisk : FALSE
#memfrac : 0.001
#progress : 3
#verbose : FALSE
#memmin : 1
#tolerance : 0.1
#datatype : FLT4S
#memmax : 0.01
This is a bug; the "memmin" option was not captured. Thank you for reporting this (when you suspect a bug, the terra github site is a better place to report it).
I believe this has now been fixed in terra 1.7-62, which should be available from R-Universe in an hour or so and can then be installed with
install.packages('terra', repos='https://rspatial.r-universe.dev')
"memmin" needs to be set here to lower the threshold for which checks are done.
I now see
terraOptions(memmax=0.01, memmin=0.01)
mem_info(in_data)
#------------------------
#Memory (GB)
#------------------------
#check threshold : 0.01 (memmin)
#available : 0.01 (memmax)
#allowed (60%) : 0.01
#needed (n=1) : 0.22
#------------------------
#proc in memory : FALSE
#nr chunks : 63
#------------------------
An alternative approach would be to use the "steps" option (and the "verbose" option to see what is going on inside). Each step is a row. This is not documented because it was intended for development/debugging, but that could change. Here I use the extreme case of one row at a time (it won't go lower).
f <- function(x, y, z) {
x + y + z
}
x <- lapp(in_data, f, filename='test.tif', overwrite=TRUE,
wopt=list(steps=nrow(in_data), verbose=TRUE))
#filename : test.tif
#compute stats : 1, GDAL: 0, minmax: 0, approx: 1
#driver : GTiff
#disk available: 176.9 GB
#disk needed : 0 GB
#memory avail. : 45.45 GB
#memory allow. : 27.27 GB
#memory needed : 0.298 GB (4 copies)
#in memory : true
#block size : 1000 rows
#n blocks : 1000
#pb : 3
(the reported "block size" is not correct because of the use of "steps", but "n blocks" is correct)