Here is an MWE for resampling a time series in xarray
vs. pandas
. The 10Min
resample takes 6.8 seconds in xarray
and 0.003 seconds in pandas
. Is there some way to get the Pandas speed in xarray? Pandas resample seems to be independent of the period, while xarray scales with the period.
import numpy as np
import xarray as xr
import pandas as pd
import time
def make_ds(freq):
size = 100000
times = pd.date_range('2000-01-01', periods=size, freq=freq)
ds = xr.Dataset({
'foo': xr.DataArray(
data = np.random.random(size),
dims = ['time'],
coords = {'time': times}
)})
return ds
for f in ["1s", "1Min", "10Min"]:
ds = make_ds(f)
start = time.time()
ds_r = ds.resample({'time':"1H"}).mean()
print(f, 'xr', str(time.time() - start))
start = time.time()
ds_r = ds.to_dataframe().resample("1H").mean()
print(f, 'pd', str(time.time() - start))
: 1s xr 0.040313720703125
: 1s pd 0.0033435821533203125
: 1Min xr 0.5757267475128174
: 1Min pd 0.0025794506072998047
: 10Min xr 6.798743486404419
: 10Min pd 0.0029947757720947266
As per the xarray
GH issue this is an implementation issue. The solution is to do the resampling (actually a GroupBy
) in other code. My solution is to use the fast Pandas resample and then rebuild the xarray dataset:
df_h = ds.to_dataframe().resample("1H").mean() # what we want (quickly), but in Pandas form
vals = [xr.DataArray(data=df_h[c], dims=['time'], coords={'time':df_h.index}, attrs=ds[c].attrs) for c in df_h.columns]
ds_h = xr.Dataset(dict(zip(df_h.columns,vals)), attrs=ds.attrs)