Search code examples
pythonpython-xarrayzarr

Better way to identify chunks where data is available in zarr


I have a zarr store of weather data with 1 hr time interval for the year 2022. So 8760 chunks. But there are data only for random days. How do i check which are the hours in 0 to 8760, the data is available? Also the store is defined with "fill_value": "NaN",

I am iterating over each hour and checking for all nan as below (using xarray) to identify if there is data or not. But its a very time consuming process.

hours = 8760
for hour in range(hours):
    if not np.isnan(np.array(xarrds['temperature'][hour])).all():
        print(f"data available in hour: {i}")

is there a better way to check the data availablity?


Solution

  • Don't use an outer loop, and execute the command in parallel using dask:

    # assuming your data is already chunked along time, i.e. .chunk({'time': 1})
    da = xarrds['temperature']
    
    # get the names of non-time dims to reduce over
    non_time_dims = [d for d in da.dims if d != 'time']
    
    # create boolean DataArray indexed by time giving where array is all NaN
    all_null_by_hour = da.isnull().all(dim=non_time_dims)
    
    # compute the array
    all_null_by_hour = all_null_by_hour.compute()