Search code examples
pythonpython-xarray

Preserve coordinates in xarray.Dataset when subsetting a DataArray


I want to have multiple coordinate systems for a given dimension, e.g., a single "time" dimension with coordinates for (the default) datetime64[ns] type but also for a numeric year, or a season. This seems to be possible using Dataset.assign_coords() but has some unexpected behavior when subsetting a DataArray within the Dataset.

I'm using the publicly available NASA dataset available here (with Earthdata login); described in more detail at this link.

import xarray as xr
import numpy as np

ds = xr.open_mfdataset('GRCTellus.JPL.200204_202311.GLO.RL06.1M.MSCNv03CRI.nc')
ds.coords['lon'] = ds.coords['lon'] - 180

# Get a list of all the dates from the "time" coordinate
dates = ds.coords['time'].values

# Convert each "datetime64[ns]" object to a 4-digit year
years = []
for each in dates:
    years.append(int(str(each)[0:4]))

# Create a new set of coordinates
ds = ds.assign_coords(year = ('year', years))
<xarray.Dataset>
Dimensions:        (lat: 360, time: 227, lon: 720, bounds: 2, year: 227)
Coordinates:
  * lat            (lat) float64 -89.75 -89.25 -88.75 ... 88.75 89.25 89.75
  * time           (time) datetime64[ns] 2002-04-17T12:00:00 ... 2023-11-16
  * lon            (lon) float64 -179.8 -179.2 -178.8 ... 178.8 179.2 179.8
  * year           (year) int64 2002 2002 2002 2002 2002 ... 2023 2023 2023 2023

This works great, so far, in that I have a year coordinate now. I also have a year dimension listed, which is not what I wanted.

The more serious problem is that when I select a DataArray from this Dataset, it doesn't know anything about the new coordinates.

ds['lwe_thickness'].sel(lon = slice(-124, -114), lat = slice(32, 42))
<xarray.DataArray 'lwe_thickness' (time: 227, lat: 20, lon: 20)>
dask.array<getitem, shape=(227, 20, 20), dtype=float64, chunksize=(46, 20, 20), chunktype=numpy.ndarray>
Coordinates:
  * lat      (lat) float64 32.25 32.75 33.25 33.75 ... 40.25 40.75 41.25 41.75
  * time     (time) datetime64[ns] 2002-04-17T12:00:00 ... 2023-11-16
  * lon      (lon) float64 -123.8 -123.2 -122.8 -122.2 ... -115.2 -114.8 -114.2

The result is that "time" is listed as a coordinate, but not "year". This has implications for using DataArray.polyfit(), because without a "year" coordinate I can't get coefficient values denominated by year.

How can I make a new coordinate system available to each DataArray in my Dataset? I have looked at:


Solution

  • You are telling it to connect the new coordinate to a newly created dimension 'year'. I didn't try this on your exact data, but try connecting the coordinate 'year' to the existing dimension 'time'

     ds = ds.assign_coords(year = ('time', years))
    

    Then the 'year' coordinate should also show up in the DataArrays that are defined over the dimension 'time'.