how can we resample time series in polars

I'd like to use the bucket expression with groupby, to downsample on monthly basis, as the downsampling function will be deprecated. Is there a easy way to do this, datetime.timedelta only works on days and lower.

Solution

With the landing of group_by_dynamic we can now downsample and use the whole expression API for our aggregations. Meaning we can resample by either.

upsampling
downsampling
first upsample and then downsample

Let's go through an example:

import polars as pl
from datetime import datetime

df = pl.DataFrame(
    {"time": pl.datetime_range(start=datetime(2021, 12, 16), end=datetime(2021, 12, 16, 3), interval="30m", eager=True),
     "groups": ["a", "a", "a", "b", "b", "a", "a"],
     "values": [1., 2., 3., 4., 5., 6., 7.]
    })

print(df)

shape: (7, 3)
┌─────────────────────┬────────┬────────┐
│ time                ┆ groups ┆ values │
│ ---                 ┆ ---    ┆ ---    │
│ datetime[μs]        ┆ str    ┆ f64    │
╞═════════════════════╪════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a      ┆ 1.0    │
│ 2021-12-16 00:30:00 ┆ a      ┆ 2.0    │
│ 2021-12-16 01:00:00 ┆ a      ┆ 3.0    │
│ 2021-12-16 01:30:00 ┆ b      ┆ 4.0    │
│ 2021-12-16 02:00:00 ┆ b      ┆ 5.0    │
│ 2021-12-16 02:30:00 ┆ a      ┆ 6.0    │
│ 2021-12-16 03:00:00 ┆ a      ┆ 7.0    │
└─────────────────────┴────────┴────────┘

Upsampling

Upsampling can be done by defining an interval. This will yield a DataFrame with nulls, which can then be filled with a fill strategy or interpolation.

df.upsample("time", every="15m").fill_null(strategy="forward")

shape: (13, 3)
┌─────────────────────┬────────┬────────┐
│ time                ┆ groups ┆ values │
│ ---                 ┆ ---    ┆ ---    │
│ datetime[μs]        ┆ str    ┆ f64    │
╞═════════════════════╪════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a      ┆ 1.0    │
│ 2021-12-16 00:15:00 ┆ a      ┆ 1.0    │
│ 2021-12-16 00:30:00 ┆ a      ┆ 2.0    │
│ 2021-12-16 00:45:00 ┆ a      ┆ 2.0    │
│ 2021-12-16 01:00:00 ┆ a      ┆ 3.0    │
│ …                   ┆ …      ┆ …      │
│ 2021-12-16 02:00:00 ┆ b      ┆ 5.0    │
│ 2021-12-16 02:15:00 ┆ b      ┆ 5.0    │
│ 2021-12-16 02:30:00 ┆ a      ┆ 6.0    │
│ 2021-12-16 02:45:00 ┆ a      ┆ 6.0    │
│ 2021-12-16 03:00:00 ┆ a      ┆ 7.0    │
└─────────────────────┴────────┴────────┘

(df.upsample("time", every="15m")
   .interpolate()
   .fill_null(strategy="forward")  # string columns cannot be interpolated
)

shape: (13, 3)
┌─────────────────────┬────────┬────────┐
│ time                ┆ groups ┆ values │
│ ---                 ┆ ---    ┆ ---    │
│ datetime[μs]        ┆ str    ┆ f64    │
╞═════════════════════╪════════╪════════╡
│ 2021-12-16 00:00:00 ┆ a      ┆ 1.0    │
│ 2021-12-16 00:15:00 ┆ a      ┆ 1.5    │
│ 2021-12-16 00:30:00 ┆ a      ┆ 2.0    │
│ 2021-12-16 00:45:00 ┆ a      ┆ 2.5    │
│ 2021-12-16 01:00:00 ┆ a      ┆ 3.0    │
│ …                   ┆ …      ┆ …      │
│ 2021-12-16 02:00:00 ┆ b      ┆ 5.0    │
│ 2021-12-16 02:15:00 ┆ b      ┆ 5.5    │
│ 2021-12-16 02:30:00 ┆ a      ┆ 6.0    │
│ 2021-12-16 02:45:00 ┆ a      ┆ 6.5    │
│ 2021-12-16 03:00:00 ┆ a      ┆ 7.0    │
└─────────────────────┴────────┴────────┘

Downsampling

This is the powerful one, because we can also combine this with normal groupby keys. Having a virtual moving window over the time series (grouped by one or multiple keys) that can be aggregated with the expression API.

(df.group_by_dynamic(
    index_column="time", 
    every="1h", 
    closed="both", 
    group_by="groups", 
    include_boundaries=True
)
    .agg(
        count = pl.col("time").count(), 
        max = pl.col("time").max(),
        sum = pl.col("values").sum()
    ))

shape: (6, 7)
┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────┬─────────────────────┬──────┐
│ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ count ┆ max                 ┆ sum  │
│ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---   ┆ ---                 ┆ ---  │
│ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32   ┆ datetime[μs]        ┆ f64  │
╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════╪═════════════════════╪══════╡
│ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3     ┆ 2021-12-16 01:00:00 ┆ 6.0  │
│ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1     ┆ 2021-12-16 01:00:00 ┆ 3.0  │
│ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2     ┆ 2021-12-16 03:00:00 ┆ 13.0 │
│ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1     ┆ 2021-12-16 03:00:00 ┆ 7.0  │
│ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2     ┆ 2021-12-16 02:00:00 ┆ 9.0  │
│ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1     ┆ 2021-12-16 02:00:00 ┆ 5.0  │
└────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────┴─────────────────────┴──────┘