Is it possible to speed up xarray-based code with numba?

I have a codebase that makes heavy use of the xarray package. The computations (pointwise arithmetic, dot products, and built-in numpy ufuncs mostly) are already heavily vectorized. I was looking into numba for further speeding up this code. One reason is that the code is clearly not running parallelized (only one core used), so I thought that numba's @jit(parallel=True) decorator could help. (As far as I have tried, it doesn't.) Whenever I try to use @jit(nopython=True), exceptions are raised, so I guess this means numba cannot handle the underlying xarray functions.

So:

Has anyone successfully sped up xarray-based code using numba?
If yes: How?
If not: Are there reasons to try further or to abandon this attempt?

Solution

If the critical parts of the code you want to speed up can be cast as (generalized) ufuncs and optimized using numba.vectorize or numba.guvectorize, you could then use xarray.apply_ufunc to apply them to your xarray data with dimensions and broadcasting being automatically taken care of.

Depending on your code, it might also help or even be easier and/or be more useful to parallelize and optimize the computations using the xarray Dask interface. In many cases it can be as simple as calling the .chunk() method, performing the necessary operations and finally calling the .compute() method at the end.

To answer the first part of the question, yes, I use a combination of these approaches in my projects, see here for a real-world example.