python numpy numba python-xarray numpy-ufunc

Applying a Numba guvectorize function over time dimension of an 3D-Array with Xarray's apply_ufunc

I have some problems getting this to work properly and I'm also open to other suggestions as I'm not 100% sure if I'm going the right way with this.

Here is some simple dummy data:

times = pd.date_range(start='2012-01-01',freq='1W',periods=25)
x = np.array([range(0,20)]).squeeze()
y = np.array([range(20,40)]).squeeze()

data = np.random.randint(3, size=(25,20,20))

ds = xr.DataArray(data, dims=['time', 'y', 'x'], coords = {'time': times, 'y': y, 'x': x})

For each x,y-coordinate, I want to return the longest sequence of 1s or 2s over time. So my input array is 3D (time, x, y) and my output 2D (x, y). The code in 'seq_gufunc' is inspired by this thread. My actual dataset is much larger (with landuse classes instead of 1s, 2s, etc) and this is only a small part of a bigger workflow, where I'm also using dask for parallel processing. So in the end this should run fast and efficiently, which is why I ended up trying to figure out how to get numba's @guvectorize and Xarray's apply_ufunc to work together:


@guvectorize(
    "(int64[:], int64[:])",
    "(n) -> (n)", target='parallel', nopython=True
)
def seq_gufunc(x, out):

    f_arr = np.array([False])

    bool_stack = np.hstack((f_arr, (x == 1) | (x == 2), f_arr))

    # Get start, stop index pairs for sequences
    idx_pairs = np.where(np.diff(bool_stack))[0].reshape(-1, 2)

    # Get length of longest sequence
    longest_seq = np.max(np.diff(idx_pairs))

    out[:] = longest_seq


## Input for dim would be: 'time' 
def apply_seq_gufunc(data, dim):

    return xr.apply_ufunc(seq_gufunc,
                          data,
                          input_core_dims=[[dim]],
                          exclude_dims=set((dim,)), 
                          dask="allowed")

There are probably some very obvious mistakes that hopefully someone can point out. I have a hard time understanding what actually goes on in the background and how I should set up the layout-string of @guvectorize and the parameters of apply_ufunc so that it does what I want.

EDIT2: This is the working solution. See @OriolAbril 's answer for more information about the parameters of apply_ufunc and guvectorize. It was also necessary to implement the if...else... clause in case no values match and to avoid the ValueError that would be raised.

@guvectorize(
    "(int64[:], int64[:])",
    "(n) -> ()", nopython=True
)
def seq_gufunc(x, out):

    f_arr = np.array([False])
    bool_stack = np.hstack((f_arr, (x == 1) | (x == 2), f_arr))

    if np.sum(bool_stack) == 0:
        longest_seq = 0

    else:
        # Get start, stop index pairs for sequences
        idx_pairs = np.where(np.diff(bool_stack))[0].reshape(-1, 2)

        # Get length of longest sequence
        longest_seq = np.max(np.diff(idx_pairs))   

    out[:] = longest_seq


def apply_seq_gufunc(data, dim):

    return xr.apply_ufunc(seq_gufunc,
                          data,
                          input_core_dims=[[dim]],
                          dask="parallelized",
                          output_dtypes=['uint8']
                         )

Solution

I'd point you out to How to apply a xarray u_function over NetCDF and return a 2D-array (multiple new variables) to the DataSet, the immediate goal is not the same, but the detailed description and examples should clarify the issue.

In particular, you are right in using time as input_core_dims (in order to make sure it is moved to the last dimension) and it is correctly formatted as a list of lists, however, you do not need excluded_dims but output_core_dims==[["time"]].

The output has the same shape as the input, however, as explained in the link above, apply_ufunc expects it will have same shape as broadcasted dims. output_core_dims is needed to get apply_ufunc to expect output with dims y, x, time.