Search code examples
pythonpandasgroup-bytime-seriestimestamp

Rolling mean, count or quantile per group


I have a large set of tick data with milliseconds granularity, which I have loaded into a pandas DataFrame. To make handling easier, I have have added columns assigning each timestamp to a year, month, week, day, hour, minute.

I would like to have a mean, percentile, and count based on a 4 week rolling window. I can't just roll trough the timestamps/rows, as each week has a different number of timestamps (ranging from50k to 70k depending on weekly activity). This is why I would like to group the timestamps into their respective weeks and calculate the mean, percentile, count of all timestamps within the rolling set of weeks.

If I were to first get the mean, percentile, count per individual week, create a new df, and roll through that, I'm afraid that information will be lost - especially regarding the percentile.

I have tried both the below code snippets (add to existing df and create new series):

df.groupby("week")["price"].transform(lambda x: x.rolling(4,1).mean()
_ma = df.groupby("week")["price].rolling(4,1).mean()

Neither seem to work and return an average of the last 4 timestamps/rows - not of all timestamps grouped within the last 4 weeks.

For now I have found a workaround using a loop to create a list of dictionaries, which are then mapped to the df.

While below does work, I was hoping for a more elegant solution.

workaround solution:

weeks = df["week"].unique()
window = 4

results = []

for i in range(len(weeks) - window + 1):
    try:

        weeks_group = weeks[i: i + window]
        key = weeks[i + window]
        weeks_rolling = df[df["week"].isin(weeks_group)]
        average_price = weeks_rolling["price"].mean()
        quantile = weeks_rolling["price"].quantile(0.5)
        count = weeks_rolling["price"].count()
        results.append({key:[average_price,quantile,count]})

    except: print("data not available")

Solution

  • Your requirements can't be satisfied with pandas's rolling function. Not only that each window has a variable window start (it can be 50k or 70k rows above), the window can extend below the current row as well. You can try defining your own window function with BaseIndexer but that's nowhere near "elegant".

    My recommendation is to keep your current solution: it's simple, it's easy to understand, it's not too slow.

    If you want speed, you can implement a numpy solution:

    # This solution assumes that your dataframe is sorted and sequentially-indexed.
    # If that is not the case already, you can do the following:
    #   df = df.sort_values("timestamp").reset_index(drop=True)
    
    window = 4
    
    weeks = df["week"].unique()
    # Find last index of each week
    end_index = df.drop_duplicates("week", keep="last").index + 1
    # Split the price column into weeks
    prices = np.split(df["price"].to_numpy(), end_index)
    
    results = {
        week: [np.mean(p), np.quantile(p, 0.5), len(p)]
        for i, week in enumerate(weeks)
        if (i >= window - 1)
        # np.hstack combines prices of the last `window` weeks together.
        # The walrus operator (:=) is a trick used to assign the slice to a
        # variable. Require Python 3.8+.
        and (p := np.hstack(prices[i - window + 1 : i + 1])) is not None
    }