Search code examples
pythonperformancepython-xarrayfillna

Speeding up xarray's fillna


I have a rather large netCDF file (~10 GB) which has a fill value of -1.0

When I use xarray's fillna like this:

hndl_nc = hndl_nc.fillna(0.0)

It is slow (~2 min), is there another operator which might be faster? Or perhaps, given the size of the file this is to be expected?


Solution

  • At ~85MB/s, this is in the ball park of typical performance for vectorized NumPy/xarray operations. I think it's unlikely you could improve on this significantly by simply using another built-in operation.

    You might still be able to improve performance with some experimentation. The first thing to do is to profile and look at CPU usage to determine where the time is being spent.

    • If you're CPU bound in Python: try using Dask to parallelize operations, if you aren't using it already
    • If you're CPU bound in the netCDF/HDF5 process: this is probably a symptom of netCDF4 files with in-file zlib compression (which is pretty slow). Either load your data into memory ahead of time (using .load()), rewrite your files without compression, or try using xarray v0.9.0 or newer (currently in release candidate) with Dask distributed or multi-processing.
    • If you're IO bound, consider:
      • engine='scipy' can be faster, if you have netCDF3 files
      • switch to scale_factor/add_offset to compress the data in int16 rather than larger float types