I have two raster image geotiffs, one of (275521, 329643) dims (let's call this the yearly image) and the other of (73583, 152367) dims (the monthly image). The cell sizes of both rasters are already the same. Using Python, I would like to do the following:
I have been working with rasterio and rioxarray in Python on Windows and would like to learn how to do this in Python because I will be looping through 8 yearly image files and for each of those years, 12 monthly image files.
Predictably, I have been running into memory errors, specifically when i try to create the mask from the yearly image,
MemoryError: Unable to allocate 10.4 GiB for an array with shape (1, 73583, 152367) and data type uint8
I understand that I should try and work with some kind of multiprocessing tool and have been trying to implement what I need using Dask but honestly have no idea what I'm doing or how to begin processing these arrays (i.e., creating a massive array [the yearly mask], multiplying this massive array with another massive array [monthly image] to create yet another massive array [masked monthly data]). Would appreciate all and any kind of help or advice, code examples, tutorials? many thanks.
Not sure how to provide the data but if it helps they are image files downloaded from google earth engine. Here's some sample code:
import rasterio as rio
import xarray as xr
import rioxarray
import numpy as np
years = np.arange(2012, 2020)
monthly_file_paths = ["\paths\to\monthly\image\tifs"]
yearly_file_paths = ["\paths\to\yearly\image\tifs"]
monthly_sample = xr.open_rasterio(monthly_file_paths[0], chunks={"x":256, "y":256})
for yearly_file in yearly_file_paths:
yearly = xr.open_rasterio(yearly_file, chunks={"x":256, "y":256}) # !!!: Previously encountered memory error here before including the "chunks" bit
yearly = yearly.rio.set_nodata(0)
# crop yearly layer to monthly bounding box
yearly = yearly.rio.clip_box(minx=monthly_sample.x.min().item(),
miny=monthly_sample.y.min().item(),
maxx=monthly_sample.x.max().item(),
maxy=monthly_sample.y.max().item())
# reproject and resample yearly to match monthly extents and crs
yearly = yearly.rio.reproject_match(monthly_sample )
yearly = yearly.assign_coords({
"x": monthly_sample.x,
"y": monthly_sample.y})
# create mask.
yearly.data = xr.where(yearly==3, 1, 0) # !!!: Error occurs here
for month_file in monthly_file_paths :
_monthly = xr.open_rasterio(month_file)
_monthly.data = xr.where(_monthly==2, 1, 0)
yearly_monthly = yearly * _monthly
yearly_monthly = yearly_monthly.rio.update_attrs(_monthly.attrs)
yearly_monthly.rio.to_raster(f'{month_file}.tif', dtype=np.uint8,
compress='deflate')```
This is going to be a tough problem! Working with raster files this large is always gnarly. I don't have a silver bullet answer for you, but here are some things I'd consider.
I would start out by inspecting your files directly, and making sure your data is aligned precisely, e.g. with _monthly.x.isin(yearly.x.values).all()
and the same for y.
Then, only read in those indices when opening the data, using da.sel
.
Also, make sure you're chunking the _monthly file, not just the annual one:
_monthly = xr.open_rasterio(month_file)
As you do this, try to ensure that your chunks are aligned exactly (you don't want them offset so that multiple chunks need to be loaded in order to do each chunk's merge. xr.unify_chunks
may be helpful here - I haven't used it but I think this is the intended use case.
Another issues is that xarray assumes you always want 64-bit data types unless you specify. The line
_monthly.data = xr.where(_monthly==2, 1, 0)
first converts _monthly==2
to type bool, then casts the python int
s 1
and 0
to the same shape as your input array, and guesses that they should be 64-bit ints. This turns your 10GB problem into an 80GB one! See this simple example:
In [10]: data = xr.DataArray(np.array([1], dtype='uint8'))
In [11]: xr.where(data==1, 1, 0).dtype
Out[11]: dtype('int64')
Instead, be super pedantic about your types:
In [12]: one_u8 = np.array(1, dtype='uint8')
In [13]: zero_u8 = np.array(0, dtype='uint8')
In [14]: xr.where(data==1, one_u8, zero_u8).dtype
Out[14]: dtype('uint8')
Finally, anything involving "reprojecting" or reshaping the data could really blow things up:
yearly = yearly.rio.reproject_match(monthly_sample)
My concerns with reprojecting are:
The good news is you're definitely on the right track. I would do this exactly how you're approaching this, with rioxarray, xarray, and dask. Step through the workflow line by line rather than trying to write this all out and debug it as it fails. Inspect the xarray objects at each step, looking closely at the dask chunks, sizes, and dtypes in the da.data
object as you go. If its an option, prototype this in a jupyterlab notebook - the html interface for dask-backed xarray.DataArrays is really helpful.
The best advice I can offer is to not try to do this all in one go, at least not until you know you have each part of the workflow running smoothly. Instead, break up your workflow into steps, and write to disk after each one, reading from the intermediate file for the next step.
xr.align(..., join='exact')
One thing that might be helpful is to use a file type that supports parallel I/O and chunking explicitly, such as zarr. If you swap out zarr for rasterio for your caching step, this may get a bit easier.