Search code examples
pythongeospatialnetcdfpython-xarray

Writing xarray dataset is much slower than reading it?


I use xarray open_dataset to open about 4000 NetCDF files. I like to save the dataset without any processing. While reading the 4000 files takes about a minute (no lazy reading), writing it might take hours. The total size of the directory containing the files is about 750 Mb. Here I compared reading and writing individual files using xr.open_dataset vs. xr.open_mfdataset:

Reading individual files:

ds = xr.open_dataset('path to file')
CPU times: user 7.95 ms, sys: 0 ns, total: 7.95 ms
Wall time: 6.95 ms

Writing individual files:

ds.to_netcdf('path to output')
CPU times: user 76.3 ms, sys: 4.01 ms, total: 80.4 ms
Wall time: 78.4 ms

Reading all the files with open_mfdaset:

ds_all = xr.open_mfdataset('path to files/*.nc')
CPU times: user 1min 3s, sys: 880 ms, total: 1min 4s
Wall time: 1min 4s

Select one of the files:

ds_sel = ds_all.isel(time=1)
CPU times: user 20.2 ms, sys: 5 µs, total: 20.2 ms
Wall time: 19.6 ms

Writing the selected dataset:

ds_sel.to_netcdf('path to output')
CPU times: user 3.89 s, sys: 11.9 ms, total: 3.9 s
Wall time: 3.89 s

As you can see writing each file of the dataset takes ~4 seconds meaning that for 4000 files it takes about ~5 hours. Compare this 4 seconds with 78 ms when we write individual files opened with open_dataset.

I am not sure what I am doing wrong here?

BTW here is how each file looks like:

enter image description here


Solution

  • There are a number of reasons why opening datasets in xarray may be faster than writing:

    • xarray.open_dataset uses lazy loading with netCDFs. This is the defualt behavior of the NetCDF library (as well as a number of other backends such as zarr) - xarray just exposes it. See the User Guide section on reading NetCDFs:

      Data is always loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until you try to perform some sort of actual computation. For an example of how these lazy arrays work, see the OPeNDAP section below.

    • xarray.open_dataset also caches data by default. From the API documentation:

      cache (bool, optional) – If True, cache data loaded from the underlying datastore in memory as NumPy arrays when accessed to avoid reading from the underlying data- store multiple times. Defaults to True unless you specify the chunks argument to use dask, in which case it defaults to False. Does not change the behavior of coordinates corresponding to dimensions, which always load their data from disk into a pandas.Index.

    • xarray.open_mfdataset uses dask, meaning that any non-blocking operations you perform, including slicing, most computation, many reshape operations, etc. will only schedule the tasks, not execute them. So if you're not including the execution time in your profiling, you're only capturing the time it takes dask to infer the shapes and types of the arrays involved as well as the time it takes to schedule subsequent operations.

    xarray.Dataset.to_netcdf on the other hand is a blocking operation, and the compute time will reflect the full time to complete any scheduled dask operations, to load any lazy-loaded data which has not yet been pulled into memory, and to write the full array(s).

    Note also that xarray can only write netCDFs in serial, so even if you use dask for your operation, all of your data must flow through the single main thread before it can be written. Zarr is a similar storage format which allows distributed writes, getting around this bottleneck for distributed systems. This is usally less of a concern if you're working on a local machine which is likely constrained by write speeds, but on networked storage or on a cluster this can become a real performance factor.