python gis python-xarray geotiff rasterio

memory error while processing large geotiffs on python

I have two raster image geotiffs, one of (275521, 329643) dims (let's call this the yearly image) and the other of (73583, 152367) dims (the monthly image). The cell sizes of both rasters are already the same. Using Python, I would like to do the following:

Clip the yearly image so that it is the same as the monthly image extents.
Match the extents of the yearly image to monthly image extents.
Create a binary mask from the yearly image file, which is a categorical dataset (vals 0/1/2/3) such that all cells containing 3 return a value of 1, and all other cells are 0.
Multiply the monthly image file with this new mask layer to filter out the cells that align with the yearly.data == 3 cells.

I have been working with rasterio and rioxarray in Python on Windows and would like to learn how to do this in Python because I will be looping through 8 yearly image files and for each of those years, 12 monthly image files.

Predictably, I have been running into memory errors, specifically when i try to create the mask from the yearly image,

MemoryError: Unable to allocate 10.4 GiB for an array with shape (1, 73583, 152367) and data type uint8

I understand that I should try and work with some kind of multiprocessing tool and have been trying to implement what I need using Dask but honestly have no idea what I'm doing or how to begin processing these arrays (i.e., creating a massive array [the yearly mask], multiplying this massive array with another massive array [monthly image] to create yet another massive array [masked monthly data]). Would appreciate all and any kind of help or advice, code examples, tutorials? many thanks.

Not sure how to provide the data but if it helps they are image files downloaded from google earth engine. Here's some sample code:

import rasterio as rio
import xarray as xr
import rioxarray
import numpy as np

years = np.arange(2012, 2020)
monthly_file_paths = ["\paths\to\monthly\image\tifs"]
yearly_file_paths = ["\paths\to\yearly\image\tifs"]
monthly_sample = xr.open_rasterio(monthly_file_paths[0], chunks={"x":256, "y":256})

for yearly_file in yearly_file_paths:
    yearly = xr.open_rasterio(yearly_file, chunks={"x":256, "y":256}) # !!!: Previously encountered memory error here before including the "chunks" bit
    yearly = yearly.rio.set_nodata(0)
    
    # crop yearly layer to monthly bounding box
    yearly = yearly.rio.clip_box(minx=monthly_sample.x.min().item(),
                    miny=monthly_sample.y.min().item(),
                    maxx=monthly_sample.x.max().item(),
                    maxy=monthly_sample.y.max().item())
    
    # reproject and resample yearly to match monthly extents and crs
    yearly = yearly.rio.reproject_match(monthly_sample )
    yearly = yearly.assign_coords({
            "x": monthly_sample.x,
            "y": monthly_sample.y})

    # create mask.
    yearly.data = xr.where(yearly==3, 1, 0) # !!!: Error occurs here

    for month_file in monthly_file_paths :
        _monthly = xr.open_rasterio(month_file)
        _monthly.data = xr.where(_monthly==2, 1, 0)
        yearly_monthly = yearly * _monthly
        yearly_monthly = yearly_monthly.rio.update_attrs(_monthly.attrs)
        yearly_monthly.rio.to_raster(f'{month_file}.tif', dtype=np.uint8,
                              compress='deflate')```

Solution

This is going to be a tough problem! Working with raster files this large is always gnarly. I don't have a silver bullet answer for you, but here are some things I'd consider.

Alignment & Chunking

I would start out by inspecting your files directly, and making sure your data is aligned precisely, e.g. with _monthly.x.isin(yearly.x.values).all() and the same for y.

Then, only read in those indices when opening the data, using da.sel.

Also, make sure you're chunking the _monthly file, not just the annual one:

_monthly = xr.open_rasterio(month_file)

As you do this, try to ensure that your chunks are aligned exactly (you don't want them offset so that multiple chunks need to be loaded in order to do each chunk's merge. xr.unify_chunks may be helpful here - I haven't used it but I think this is the intended use case.

Types

Another issues is that xarray assumes you always want 64-bit data types unless you specify. The line

 _monthly.data = xr.where(_monthly==2, 1, 0)

first converts _monthly==2 to type bool, then casts the python ints 1 and 0 to the same shape as your input array, and guesses that they should be 64-bit ints. This turns your 10GB problem into an 80GB one! See this simple example:

In [10]: data = xr.DataArray(np.array([1], dtype='uint8'))
In [11]: xr.where(data==1, 1, 0).dtype
Out[11]: dtype('int64')

Instead, be super pedantic about your types:

In [12]: one_u8 = np.array(1, dtype='uint8')
In [13]: zero_u8 = np.array(0, dtype='uint8')
In [14]: xr.where(data==1, one_u8, zero_u8).dtype
Out[14]: dtype('uint8')

Reprojecting

Finally, anything involving "reprojecting" or reshaping the data could really blow things up:

yearly = yearly.rio.reproject_match(monthly_sample)

My concerns with reprojecting are:

You need to make sure your coordinate labels are identical after reprojecting, not before. So make sure every single x and y is shared between the two arrays before merging, but after reprojecting. I know this is the goal of reproject_match, but this is the kind of thing that can go wrong when relying on float coordinates to match exactly. You might consider using an integer coordinate after reprojecting if you need to do this.
Does this load the data? What happens to the chunks? I think this is the kind of thing that should work; whether it works in practice when you're talking about a job this big is always another question. You're pushing xarray/rioxarray's limit when dealing with such large arrays, as a single load or casual float64 coersion could blow up your memory.

Possible ways forward

The good news is you're definitely on the right track. I would do this exactly how you're approaching this, with rioxarray, xarray, and dask. Step through the workflow line by line rather than trying to write this all out and debug it as it fails. Inspect the xarray objects at each step, looking closely at the dask chunks, sizes, and dtypes in the da.data object as you go. If its an option, prototype this in a jupyterlab notebook - the html interface for dask-backed xarray.DataArrays is really helpful.

The best advice I can offer is to not try to do this all in one go, at least not until you know you have each part of the workflow running smoothly. Instead, break up your workflow into steps, and write to disk after each one, reading from the intermediate file for the next step.

cut down your data to only include the needed shape
reproject. good luck :D
read in the data with chunking, and explicitly check for data alignment, e.g. with xr.align(..., join='exact')
mask & multiply

One thing that might be helpful is to use a file type that supports parallel I/O and chunking explicitly, such as zarr. If you swap out zarr for rasterio for your caching step, this may get a bit easier.