Search code examples
pythonnetcdfpython-xarraynetcdf4

Access data by month number in 3D xarray


I have data arrays (361x361) for Jan, Feb, March, Apr, Oct, Nov and Dec for a given year.

So far I've been storing them in individual netcdfs for every month in the year (e.g. 03.nc, 10.nc)

I'd like to combine all months into one netcdf, so that I can do something like:

march_data = data.sel(month='03') 

or alternatively data.sel(month=3))

So far I've only been able to stack the monthly data in a 361x361x7 array and it's unhelpfully indexed so that to get March data you need to do data[:,:,2] and to get October it's data[:,:,4]. Clearly 2 & 4 do not intuitively correspond to the months of March and October. This is in part because python is indexed from zero and in part because I'm missing the summer months. I could put nan fields in for the missing months, but that wouldn't solve the index-0 issue.

My attempt so far:

 data = xarray.Dataset( data_vars={'ice_type':(['x','y','time'],year_array),},
                      coords={'lon':(['x','y'],lon_target),
                              'lat':(['x','y'],lat_target),
                              'month_number':(['time'],month_int)})

Here year_array is a 361x361x7 numpy array, and month_int is a list that maps the third index of year_array to the month number: [1,2,3,4,10,11,12].

When I try to get Oct data with oct = data.sel(month_number=10) it throws an error.

On a side note, I'm aware that there's possibly a solution to be found here, but to be honest I don't understand how it works. My confusion is mostly based around how they use 'time' both as a dictionary key and list of times at the same time.


Solution

  • I think I've written a helper function to do something just like that:

    def combine_new_ds_dim(ds_dict, new_dim_name):
        """
        Combines a dictionary of datasets along a new dimension using dictionary keys
        as the new coordinates.
    
        Parameters
        ----------
        ds_dict : dict
            Dictionary of xarray Datasets or dataArrays
        new_dim_name : str
            The name of the newly created dimension
    
        Returns
        -------
        xarray.Dataset
            Merged Dataset or DataArray
    
        Raises
        ------
        ValueError
            If the values of the input dictionary were of an unrecognized type
        """
    
        expanded_dss = []
    
        for k, v in ds_dict.items():
            expanded_dss.append(v.expand_dims(new_dim_name))
            expanded_dss[-1][new_dim_name] = [k]
        new_ds = xr.concat(expanded_dss, new_dim_name)
    
        return new_ds
    

    If you have all of the data in individual netcdfs then you should be able to import them into individual dataArray's. Assuming you've done that, you could then do

    month_das = {
        1: january_da,
        2: february_da,
        ...
        12: december_da
    }
    
    year_data = combine_new_ds_dim(month_das, 'month')
    

    which would be the concatenation of all of the data along the new dimension month with the desired coordinates. I think the main loop of the function is easy enough to separate if you want to use that alone.

    EDIT:

    For anyone looking at this in the future, there's a much easier way of doing this with builtin xarray functions. You can just concatenate along a new dimension

    year_data = xr.concat([january_da, february_da, ..., december_da], dim="month")
    

    which will create a new dataArray with the constituent arrays concatenated along a new dimension, but without coordinates on that dimension. To add coordinates,

    year_data["month"] = [1, 2, ..., 12]
    

    at which point year_data will be concatenated along the new dimension "month" and will have the desired coordinates along that dimension.