I am working on NEE (net ecosystem exchange data) stored in a NetCDF file, which I'll call NEE_2022.nc . The file is in my wd. The spatial coverage is the entire Europe, at roughly 1km resolution; the temporal coverage is the year 2022, at hourly temporal resolution. Here I provide some more info about the data:
**1 variables** (excluding dimension variables):
float NEE[lon,lat,time]
long_name: net ecosystem exchange
units: micromoles/m2/s
_FillValue: 1.00000001504747e+30
missing_value: 1.00000001504747e+30
**3 dimensions**:
time Size:8760 *** is unlimited ***
standard_name: time
long_name: time endpoint
units: hours since 2022-01-01 00:00:00
calendar: standard
axis: T
lon Size:400
standard_name: longitude
long_name: longitude
units: degrees_east
axis: X
lat Size:480
standard_name: latitude
long_name: latitude
units: degrees_north
axis: Y
Here are my goals: I would like to avarage each pixel on a monthly basis; I would also like to create a map for each monthly NEE raster. Spatial subsetting is not needed.
NB the R.version command returns the following text:
platform x86_64-w64-mingw32
arch x86_64
os mingw32
crt ucrt
system x86_64, mingw32
status
major 4
minor 2.2
year 2022
month 10
day 31
svn rev 83211
language R
version.string R version 4.2.2 (2022-10-31 ucrt)
I had issues opening the file. When I run the following code:
library(ncdf4)
library(raster)
library(rgdal)
library(ggplot2)
library(terra)
NEE_2022 <- nc_open('NEE_2022.nc',readunlim=FALSE)
NEE.array <- ncvar_get(NEE_2022, "NEE")
I have some memory issue. I got the following error message:
Error: cannot allocate vector of size 12 gb
Then, I've succesfully created the array by playing with the "start" and "count" arguments in the ncvar_get()
function. I tried just selecting the first day (24 hours). I did the following:
NEE.array <- ncvar_get(NEE_2022, "NEE", start = c(1,1,1), count = c(-1,-1,24))
How am I supposed to proceed?
Since the NEE file is too large to fit into memory you have to process it in pieces.
Solution 1: The obvious way
Summing hourly data into daily data is the most easily understood way forward, after which you have to sum the daily data into monthly data (in your question you mention "average each pixel on a monthly basis" but I'm assuming you want to sum to "NEE per month" rather than average to "average hourly NEE per month"). The basic loop to do this looks like this:
library(ncdf4)
NEE_2022 <- nc_open('NEE_2022.nc')
# Number of days per month
days_in_month <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)
# First hour of every month in 2022
h1 <- c(1, 745, 1417, 2161, 2881, 3625, 4345, 5089, 5809, 6553, 7273, 8017)
# Array for the result
NEE_data <- array(0, dim = c(480, 400, 12))
for (m in 1:12) {
mon <- array(0, dim = c(480, 400))
for (d in 1:days_in_month[m]) {
data <- ncvar_get(NEE_2022, "NEE",
start = c(1, 1, h1[m]),
count = c(-1, -1, 24))
mon <- mon + rowSums(data, dims = 2) * 3600
}
NEE_data[, , m] <- mon
}
nc_close(NEE_2022)
You can also fold both loops into one, at the expense of much larger reads from the file, which may exhaust your memory:
for (m in 1:12) {
data <- ncvar_get(NEE_2022, "NEE",
start = c(1, 1, h1[m]),
count = c(-1, -1, 24 * days_in_month[m]))
NEE_data[, , m] <- rowSums(data, dims = 2) * 3600
}
Solution 2: The R way
In R think you should frame your problem in terms of vectors, avoiding loops as much as possible. While you cannot avoid loops because of the large size of the file, you can still vectorise the processing logic. In this case that means creating a factor over the time
dimension and then using that to extract all monthly sums in one operation, on a pixel-by-pixel basis due to the file size:
# Create the factor
f <- as.factor(rep(rep(1:12, days_in_month), each = 24))
# Loop through every pixel
for (lat in 1:480) {
for (lon in 1:400) {
data <- ncvar_get(NEE_2022, "NEE",
start = c(lat, lon, 1),
count = c(1, 1, -1))
NEE_data[lat, lon, ] <- tapply(data, f, sum) * 3600
}
}
This, however, is tediously slow due to the very large number of reads. A more performant way, reducing the number of loops by reading and processing larger blocks, would be to process entire rows of latitude in one go:
for (lat in 1:480) {
data <- ncvar_get(NEE_2022, "NEE",
start = c(lat, 1, 1),
count = c(1, -1, -1))
NEE_data[lat, , ] <- apply(data, 1, tapply, f, sum) * 3600
}
Potential pitfalls
The above code should typically be placed inside a function so it can be stored and re-used easily. The problem is that the above code is highly dependent on the specifics of the file you are processing. If you want to process another file and that data covers another area of the globe, the dimensions will likely change. If you process 2020 data, there is a leap day that you must consider.
If the data file is compliant with the CF Metadata Conventions (which is very likely), you could look at the CFtime
package as that can automate working with time series and it will generate the factor to use in the second solution by reading the details in the netCDF file.