How do I create a monthly series from hourly NetCDF data in R?

I am working on NEE (net ecosystem exchange data) stored in a NetCDF file, which I'll call NEE_2022.nc . The file is in my wd. The spatial coverage is the entire Europe, at roughly 1km resolution; the temporal coverage is the year 2022, at hourly temporal resolution. Here I provide some more info about the data:

**1 variables** (excluding dimension variables):
        float NEE[lon,lat,time]   
            long_name: net ecosystem exchange
            units: micromoles/m2/s
            _FillValue: 1.00000001504747e+30
            missing_value: 1.00000001504747e+30
**3 dimensions**:
        time  Size:8760   *** is unlimited *** 
            standard_name: time
            long_name: time endpoint
            units: hours since 2022-01-01 00:00:00
            calendar: standard
            axis: T
        lon  Size:400 
            standard_name: longitude
            long_name: longitude
            units: degrees_east
            axis: X
        lat  Size:480 
            standard_name: latitude
            long_name: latitude
            units: degrees_north
            axis: Y

Here are my goals: I would like to avarage each pixel on a monthly basis; I would also like to create a map for each monthly NEE raster. Spatial subsetting is not needed.

NB the R.version command returns the following text:

platform       x86_64-w64-mingw32               
arch           x86_64                           
os             mingw32                          
crt            ucrt                             
system         x86_64, mingw32                  
status                                          
major          4                                
minor          2.2                              
year           2022                             
month          10                               
day            31                               
svn rev        83211                            
language       R                                
version.string R version 4.2.2 (2022-10-31 ucrt)

I had issues opening the file. When I run the following code:

library(ncdf4)
library(raster) 
library(rgdal) 
library(ggplot2) 
library(terra)

NEE_2022 <- nc_open('NEE_2022.nc',readunlim=FALSE)
NEE.array <- ncvar_get(NEE_2022, "NEE")

I have some memory issue. I got the following error message:

Error: cannot allocate vector of size 12 gb

Then, I've succesfully created the array by playing with the "start" and "count" arguments in the ncvar_get() function. I tried just selecting the first day (24 hours). I did the following:

NEE.array <- ncvar_get(NEE_2022, "NEE", start = c(1,1,1), count = c(-1,-1,24))

How am I supposed to proceed?

Solution

Since the NEE file is too large to fit into memory you have to process it in pieces.

Solution 1: The obvious way

Summing hourly data into daily data is the most easily understood way forward, after which you have to sum the daily data into monthly data (in your question you mention "average each pixel on a monthly basis" but I'm assuming you want to sum to "NEE per month" rather than average to "average hourly NEE per month"). The basic loop to do this looks like this:

library(ncdf4)
NEE_2022 <- nc_open('NEE_2022.nc')

# Number of days per month
days_in_month <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31)

# First hour of every month in 2022
h1 <- c(1, 745, 1417, 2161, 2881, 3625, 4345, 5089, 5809, 6553, 7273, 8017)

# Array for the result
NEE_data <- array(0, dim = c(480, 400, 12))

for (m in 1:12) {
    mon <- array(0, dim = c(480, 400))
    for (d in 1:days_in_month[m]) {
        data <- ncvar_get(NEE_2022, "NEE", 
                          start = c(1, 1, h1[m]), 
                          count = c(-1, -1, 24))
        mon <- mon + rowSums(data, dims = 2) * 3600
    }
    NEE_data[, , m] <- mon
}
nc_close(NEE_2022)

You can also fold both loops into one, at the expense of much larger reads from the file, which may exhaust your memory:

for (m in 1:12) {
    data <- ncvar_get(NEE_2022, "NEE", 
                      start = c(1, 1, h1[m]), 
                      count = c(-1, -1, 24 * days_in_month[m]))
    NEE_data[, , m] <- rowSums(data, dims = 2) * 3600
}

Solution 2: The R way

In R think you should frame your problem in terms of vectors, avoiding loops as much as possible. While you cannot avoid loops because of the large size of the file, you can still vectorise the processing logic. In this case that means creating a factor over the time dimension and then using that to extract all monthly sums in one operation, on a pixel-by-pixel basis due to the file size:

# Create the factor
f <- as.factor(rep(rep(1:12, days_in_month), each = 24))

# Loop through every pixel
for (lat in 1:480) {
    for (lon in 1:400) {
        data <- ncvar_get(NEE_2022, "NEE", 
                          start = c(lat, lon, 1), 
                          count = c(1, 1, -1))
        NEE_data[lat, lon, ] <- tapply(data, f, sum) * 3600
    }
}

This, however, is tediously slow due to the very large number of reads. A more performant way, reducing the number of loops by reading and processing larger blocks, would be to process entire rows of latitude in one go:

for (lat in 1:480) {
    data <- ncvar_get(NEE_2022, "NEE", 
                      start = c(lat, 1, 1), 
                      count = c(1, -1, -1))
    NEE_data[lat, , ] <- apply(data, 1, tapply, f, sum) * 3600
}

Potential pitfalls

The above code should typically be placed inside a function so it can be stored and re-used easily. The problem is that the above code is highly dependent on the specifics of the file you are processing. If you want to process another file and that data covers another area of the globe, the dimensions will likely change. If you process 2020 data, there is a leap day that you must consider.

If the data file is compliant with the CF Metadata Conventions (which is very likely), you could look at the CFtime package as that can automate working with time series and it will generate the factor to use in the second solution by reading the details in the netCDF file.