Search code examples

Numpy (or scipy) binning of time series values based on timestamps

I'm trying to bin (downsample) a time series based on its timestamps. For instance:

import numpy as np
import pandas as pd

timestamps = np.linspace(0, 1000, 10000)
values = np.random.random(10000)

I usually convert it to a dataframe, and use cut (or qcut) to create the bins:

timeseries_df = pd.DataFrame({"Timestamps": timestamps, "Values": values})
timeseries_df["Bins"] = pd.cut(timeseries_df["Timestamps"],100) #downsampling by two orders of magnitude
ds_timestamps = timeseries_df.groupby("Bins").max()["Timestamps"]
ds_values = timeseries_df.groupby("Bins").mean()["Values"]

This works, but I'm writing functions that I can reuse and I'd like to avoid using pandas if possible. I've tried implementing a version of what's been suggested here

ds_timestamps = np.linspace(timestamps.min(), timestamps.max(), 100)
digitized_timestamps = np.digitize(timestamps, ds_timestamps)
ds_values = [values[digitized_timestamps == i+1].mean() for i in range(len(ds_timestamps))]

This also works but is extremely slow. Is there another way of doing this?


  • As mentioned in the comments, if your primary concern for not using Pandas is speed, I'd actually recommend using it, because it's not written entirely in Python, but it has many internal portions written using Cython (basically C), so they're very, very fast.