Search code examples
pythondaskpython-xarraygribera5

Xarray / Dask - Compute the highest temperature for every coordinate


I have a 17GB GRIB file containing temperature (t2m) data for every hour of year 2020. The dimensions of Dataset are longitude, latitude, and time.

My goal is to compute the highest temperature for every coordinate (lon,lat) in data for the whole year. I can load the file fine using Xarray, though it takes 4-5 minutes:

import xarray as xr
xarray_dataset = xr.open_dataset('cds/2020_hourly_t2m.grib', engine='cfgrib')

But calling xarray.Dataset.max() crashes the Google Colab session. Which is probably because it requires more than the available memory.

So, I probably need to use Dask to load data in chunks and make computations on those chunks and aggregate the results. I'm new to Dask and finding it difficult to read climate Dataset file using Dask.Array APIs in chunks. I've tried dask.array.from_array( xarray_dataset.to_array() ) but this too crashes the session.

My question is, how should I read this 17GB GRIB file in chunks using Dask and compute the maximum temperature for the whole year for every lon,lat pair in dataset?


Solution

  • xarray has dask-integration, which is activated when chunks kwarg is provided. The following should obviate the need to load the dataset in memory:

    import xarray as xr
    
    ds = xr.open_dataset("cds/2020_hourly_t2m.grib", engine="cfgrib", chunks="auto")
    
    test_lazy = ds.max()  # this is lazy
    test_result = test_lazy.compute()  # actual result
    

    Note the requirement to .compute() on the ds.max(). This is because operating on chunks will give lazy results that will be computed only when requested explicitly, see this tutorial.