Search code examples
pythoninterpolationgeospatialpython-xarray

Downsampling geospatial xarrays too large for memory


I have a very large (86400x43200) xarray dataset with geospatial coordinates (lat, lon) and elevation and would like to downsize this map to have one twentieth the resolution per dimension (4320, 2160, which is equivalent to 1/12th degree).

I at first used the interp method directly, but it tries to load the whole dataset and hence requests 28GB of memory, which I can't allocate.

Current approach that is detailed below:

  1. Create new downsampled dataset and fill with 0s
  2. Slice large dataset into 36 equal slices
  3. Load and interpolate one slice
  4. Add downsampled slice to my downsampled dataset.

-> Get all nan values. Step 2 works, just step 3 does not. How can I downsample such a large dataset or populate my new dataset in the correct way?

import xarray as xr
def format_spatial_resolution(
    xarray, res_lat=1 / 12, res_lon=1 / 12, sample_array=None
):

    new_lon = np.arange(xarray.lon[0], xarray.lon[-1], res_lon)
    new_lat = np.arange(xarray.lat[0], xarray.lat[-1], res_lat)
    xarray_new = xarray.interp(
        lat=new_lat,
        lon=new_lon,
        # Extrapolate to avoid nan values as edges
        kwargs={
            "fill_value": "extrapolate",
        },
    )
    return xarray_new


xarray = xr.open_dataset(filename)
# Setting resolution
res_lat = 1/12 # 1/12 degree
res_lon = 1/12
# Create a downsampled xarray with zero values
new_lon = np.arange(xarray.lon[0], xarray.lon[-1], res_lon)
new_lat = np.arange(xarray.lat[0], xarray.lat[-1], res_lat)

xarray_downsampled = xr.Dataset(
    dict(elevation=(["lat", "lon"], np.zeros((len(new_lat), len(new_lon))))),
    coords=dict(lon=new_lon, lat=new_lat),
)
# Interpolate each slice and update downsampled xarray
for i in range(-180, 180, 10):
    xarray_downsampled.update(format_spatial_resolution(xarray.sel(lon=slice(i, i + 10))))

Solution

  • I'd recommend using dask in combination with xr.DataArray.coarsen.

    Start by chunking the data on read with a chunking scheme in multiples of your coarsening window (so multiples of 20 on each dim), targeting chunk sizes in the low 100s of MBs. Chunks of (8640 x 4320) float64 data are 285MB, and then would reduce to 142MB if converted to float32 - this is a reasonable size. These are each 1/10 of the total dataset, so we'll have 100 total chunks:

    ds = xr.open_dataset(filename, chunks={"lat": 8640, "lon": 4320})
    

    Optionally, we can reduce the precision:

    ds["elevation"] = ds["elevation"].astype("float32")
    

    Finally, let's reduce the pixel density with coarsen:

    coarsened = ds.coarsen(lat=20, lon=20, boundary='exact').mean()
    

    The result is a dask-backed array; the job has not yet been executed. To trigger the computation, we can either write the file to disk, or compute the result in memory with:

    coarsened = coarsened.compute()