Search code examples
image-processingtime-seriespython-xarray

How to use apply_ufunc with numpy.digitize for each image along time dimension of xarray.DataArray?


I've rephrased my earlier question substantially for clarity. Per Ryan's suggestion on a separate channel, numpy.digitize looks is the right tool for my goal.

I have of an xarray.DataArray of shape x, y, and time. I've trying to puzzle out what values I should supply to the apply_ufunc function's 'input_core_dims' and 'output_core_dims' arguments in order to apply numpy.digitize to each image in the time series.

Intuitively, I want the output dimensions to be ['time', 'x', 'y']. I think the input core dimensions should be x and y since I want to broadcast the numpy.digitize function along the time dimension. However this doesn't work. I have my correct result by applying numpy.digitize to the first numpy array in my time series:

[84]

blues
<xarray.DataArray 'reflectance' (time: 44, y: 1082, x: 1084)>
dask.array<shape=(44, 1082, 1084), dtype=uint16, chunksize=(44, 1082, 1084)>
Coordinates:
    band     int64 1
  * y        (y) float64 9.705e+05 9.705e+05 9.705e+05 ... 9.673e+05 9.672e+05
  * x        (x) float64 4.889e+05 4.889e+05 4.889e+05 ... 4.922e+05 4.922e+05
  * time     (time) datetime64[ns] 2018-10-12 2018-10-16 ... 2019-05-26
Attributes:
    transform:   (3.0, 0.0, 488907.0, 0.0, -3.0, 970494.0)
    crs:         +init=epsg:32630
    res:         (3.0, 3.0)
    is_tiled:    1
    nodatavals:  (1.0, 1.0, 1.0, 1.0)
    scales:      (1.0, 1.0, 1.0, 1.0)
    offsets:     (0.0, 0.0, 0.0, 0.0)

[79]
#correct result
np.digitize(np.array(blues[0]), bin_arr)
array([[14, 15, 15, ..., 16, 17, 16],
       [14, 13, 14, ..., 16, 16, 15],
       [15, 14, 15, ..., 16, 16, 15],
       ...,
       [16, 18, 18, ..., 15, 16, 15],
       [17, 18, 18, ..., 16, 17, 16],
       [17, 17, 17, ..., 17, 18, 17]])

But my understanding of apply_ufunc is not correct. changing the input_core_dims to be [['x','y']] or ['time'] does not produce the correct digitized result

bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], dask="parallelized", output_dtypes=[blues.dtype])

#wrong values, correct shape
np.array(result)[0]

array([[14, 16, 15, ..., 48, 18, 15],
       [15, 16, 16, ..., 49, 18, 15],
       [15, 16, 16, ..., 49, 18, 14],
       ...,
       [16, 21, 17, ..., 50, 19, 15],
       [17, 21, 17, ..., 50, 19, 16],
       [16, 21, 18, ..., 50, 20, 17]])
bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['x','y']], dask="parallelized", output_dtypes=[blues.dtype])


#wrong values, correct shape
np.array(result)[0]

array([[14, 14, 15, ..., 16, 17, 17],
       [15, 13, 14, ..., 18, 18, 17],
       [15, 14, 15, ..., 18, 18, 17],
       ...,
       [16, 16, 16, ..., 15, 16, 17],
       [17, 16, 16, ..., 16, 17, 18],
       [16, 15, 15, ..., 15, 16, 17]])

Each of these results is of the correct shape but the wrong values, meaning the digitize function is being applied to the wrong axis and the result is reshaped to the shape of the input.

What's also strange is that the result of apply_ufunc drops the input_core_dim when displaying as an xarray. but internally, when you convert it to a numpy array, the dimension is still there

[85]

result
<xarray.DataArray 'reflectance' (y: 1082, x: 1084)>
dask.array<shape=(1082, 1084), dtype=uint16, chunksize=(1082, 1084)>
Coordinates:
    band     int64 1
  * y        (y) float64 9.705e+05 9.705e+05 9.705e+05 ... 9.673e+05 9.672e+05
  * x        (x) float64 4.889e+05 4.889e+05 4.889e+05 ... 4.922e+05 4.922e+05

[87]
# the shape of the xarray and numpy array do not match after apply_ufunc
np.array(result).shape
(1082, 1084, 44) 

additionally, when I try to specify the output_core_dims argument to be [['time', 'x', 'y']] to correct this, I get an error, it looks like you can't have a dimension be both an input core dimension and an output core dimension

[67]

bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], output_core_dims=[['time','x','y']], dask="parallelized", output_dtypes=[blues.dtype])
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
 in 
      5 bin_arr = np.linspace(configs.rmin, configs.rmax, 50)
      6 blues = t_series['reflectance'].sel(band=1).chunk({'time':-1})
----> 7 result = xr.apply_ufunc(partial(np.digitize, bins=bin_arr), blues, input_core_dims=[['time']], output_core_dims=[['time','x','y']], dask="parallelized", output_dtypes=[blues.dtype])

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, *args)
    967                                      join=join,
    968                                      exclude_dims=exclude_dims,
--> 969                                      keep_attrs=keep_attrs)
    970     elif any(isinstance(a, Variable) for a in args):
    971         return variables_vfunc(*args)

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
    215 
    216     data_vars = [getattr(a, 'variable', a) for a in args]
--> 217     result_var = func(*data_vars)
    218 
    219     if signature.num_outputs > 1:

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, output_sizes, keep_attrs, *args)
    539                   if isinstance(arg, Variable)
    540                   else arg
--> 541                   for arg, core_dims in zip(args, signature.input_core_dims)]
    542 
    543     if any(isinstance(array, dask_array_type) for array in input_data):

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in (.0)
    539                   if isinstance(arg, Variable)
    540                   else arg
--> 541                   for arg, core_dims in zip(args, signature.input_core_dims)]
    542 
    543     if any(isinstance(array, dask_array_type) for array in input_data):

~/miniconda3/envs/pyatsa/lib/python3.7/site-packages/xarray/core/computation.py in broadcast_compat_data(variable, broadcast_dims, core_dims)
    493                          'dimensions %r on an input variable: these are core '
    494                          'dimensions on other input or output variables'
--> 495                          % unexpected_dims)
    496 
    497     # for consistency with numpy, keep broadcast dimensions to the left

ValueError: operand to apply_ufunc encountered unexpected dimensions ['y', 'x'] on an input variable: these are core dimensions on other input or output variables

Any help is greatly appreciated, I'd like to understand how I'm misusing the input_core_dim and output_core_dim arguments.


Solution

  • You want to apply digitize on a point-by-point basis. This is the easiest possible use case for apply_ufunc. No special arguments are required.

    Numpy Version

    import numpy as np
    import xarray as xr
    
    ny, nx = 100, 100
    nt = 44
    data = xr.DataArray(np.random.randn(nt,ny,nx),
                            dims=['time', 'y', 'x'],
                            name='blue reflectance')
    
    rmin, rmax, nbins = -4, 4, 50
    bins = np.linspace(rmin, rmax, nbins)
    
    data_digitized = xr.apply_ufunc(np.digitize, data, bins)
    

    This returns a DataArray like

    <xarray.DataArray 'blue reflectance' (time: 44, y: 100, x: 100)>
    array([[[34, 17, ..., 27, 15],
             ....
            [21, 24, ..., 23, 29]]])
    Dimensions without coordinates: time, y, x
    

    where the values are the bin indexes, according to the conventions described in the numpy.digitize docs.

    Dask Version

    To make this operate lazily on dask arrays, you have two options

    # create chunked dask version of data
    data_chunked = data.chunk({'time': 1})
    
    # use dask's version of digitize
    import dask.array as da
    xr.apply_ufunc(da.digitize, data_chunked, bins, dask='allowed')
    
    # use xarray's built-in `parallelized` option on the numpy function
    # (I needed to define a wrapper function to make this work,
    # but I don't fully understand why.)
    def wrap_digitize(data):
        return np.digitize(data, bins)
    xr.apply_ufunc(wrap_digitize, data_chunked,
                   dask='parallelized', output_dtypes=['i8'])