Search code examples
pythonpython-polars

how can we resample time series in polars


I'd like to use the bucket expression with groupby, to downsample on monthly basis, as the downsampling function will be deprecated. Is there a easy way to do this, datetime.timedelta only works on days and lower.


Solution

  • With the landing of group_by_dynamic we can now downsample and use the whole expression API for our aggregations. Meaning we can resample by either.

    • upsampling
    • downsampling
    • first upsample and then downsample

    Let's go through an example:

    import polars as pl
    from datetime import datetime
    
    df = pl.DataFrame(
        {"time": pl.datetime_range(start=datetime(2021, 12, 16), end=datetime(2021, 12, 16, 3), interval="30m", eager=True),
         "groups": ["a", "a", "a", "b", "b", "a", "a"],
         "values": [1., 2., 3., 4., 5., 6., 7.]
        })
    
    print(df)
    
    shape: (7, 3)
    ┌─────────────────────┬────────┬────────┐
    │ time                ┆ groups ┆ values │
    │ ---                 ┆ ---    ┆ ---    │
    │ datetime[μs]        ┆ str    ┆ f64    │
    ╞═════════════════════╪════════╪════════╡
    │ 2021-12-16 00:00:00 ┆ a      ┆ 1.0    │
    │ 2021-12-16 00:30:00 ┆ a      ┆ 2.0    │
    │ 2021-12-16 01:00:00 ┆ a      ┆ 3.0    │
    │ 2021-12-16 01:30:00 ┆ b      ┆ 4.0    │
    │ 2021-12-16 02:00:00 ┆ b      ┆ 5.0    │
    │ 2021-12-16 02:30:00 ┆ a      ┆ 6.0    │
    │ 2021-12-16 03:00:00 ┆ a      ┆ 7.0    │
    └─────────────────────┴────────┴────────┘
    
    

    Upsampling

    Upsampling can be done by defining an interval. This will yield a DataFrame with nulls, which can then be filled with a fill strategy or interpolation.

    df.upsample("time", every="15m").fill_null(strategy="forward")
    
    shape: (13, 3)
    ┌─────────────────────┬────────┬────────┐
    │ time                ┆ groups ┆ values │
    │ ---                 ┆ ---    ┆ ---    │
    │ datetime[μs]        ┆ str    ┆ f64    │
    ╞═════════════════════╪════════╪════════╡
    │ 2021-12-16 00:00:00 ┆ a      ┆ 1.0    │
    │ 2021-12-16 00:15:00 ┆ a      ┆ 1.0    │
    │ 2021-12-16 00:30:00 ┆ a      ┆ 2.0    │
    │ 2021-12-16 00:45:00 ┆ a      ┆ 2.0    │
    │ 2021-12-16 01:00:00 ┆ a      ┆ 3.0    │
    │ …                   ┆ …      ┆ …      │
    │ 2021-12-16 02:00:00 ┆ b      ┆ 5.0    │
    │ 2021-12-16 02:15:00 ┆ b      ┆ 5.0    │
    │ 2021-12-16 02:30:00 ┆ a      ┆ 6.0    │
    │ 2021-12-16 02:45:00 ┆ a      ┆ 6.0    │
    │ 2021-12-16 03:00:00 ┆ a      ┆ 7.0    │
    └─────────────────────┴────────┴────────┘
    
    
    (df.upsample("time", every="15m")
       .interpolate()
       .fill_null(strategy="forward")  # string columns cannot be interpolated
    )
    
    shape: (13, 3)
    ┌─────────────────────┬────────┬────────┐
    │ time                ┆ groups ┆ values │
    │ ---                 ┆ ---    ┆ ---    │
    │ datetime[μs]        ┆ str    ┆ f64    │
    ╞═════════════════════╪════════╪════════╡
    │ 2021-12-16 00:00:00 ┆ a      ┆ 1.0    │
    │ 2021-12-16 00:15:00 ┆ a      ┆ 1.5    │
    │ 2021-12-16 00:30:00 ┆ a      ┆ 2.0    │
    │ 2021-12-16 00:45:00 ┆ a      ┆ 2.5    │
    │ 2021-12-16 01:00:00 ┆ a      ┆ 3.0    │
    │ …                   ┆ …      ┆ …      │
    │ 2021-12-16 02:00:00 ┆ b      ┆ 5.0    │
    │ 2021-12-16 02:15:00 ┆ b      ┆ 5.5    │
    │ 2021-12-16 02:30:00 ┆ a      ┆ 6.0    │
    │ 2021-12-16 02:45:00 ┆ a      ┆ 6.5    │
    │ 2021-12-16 03:00:00 ┆ a      ┆ 7.0    │
    └─────────────────────┴────────┴────────┘
    
    

    Downsampling

    This is the powerful one, because we can also combine this with normal groupby keys. Having a virtual moving window over the time series (grouped by one or multiple keys) that can be aggregated with the expression API.

    (df.group_by_dynamic(
        index_column="time", 
        every="1h", 
        closed="both", 
        group_by="groups", 
        include_boundaries=True
    )
        .agg(
            count = pl.col("time").count(), 
            max = pl.col("time").max(),
            sum = pl.col("values").sum()
        ))
    
    shape: (6, 7)
    ┌────────┬─────────────────────┬─────────────────────┬─────────────────────┬───────┬─────────────────────┬──────┐
    │ groups ┆ _lower_boundary     ┆ _upper_boundary     ┆ time                ┆ count ┆ max                 ┆ sum  │
    │ ---    ┆ ---                 ┆ ---                 ┆ ---                 ┆ ---   ┆ ---                 ┆ ---  │
    │ str    ┆ datetime[μs]        ┆ datetime[μs]        ┆ datetime[μs]        ┆ u32   ┆ datetime[μs]        ┆ f64  │
    ╞════════╪═════════════════════╪═════════════════════╪═════════════════════╪═══════╪═════════════════════╪══════╡
    │ a      ┆ 2021-12-16 00:00:00 ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 00:00:00 ┆ 3     ┆ 2021-12-16 01:00:00 ┆ 6.0  │
    │ a      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 1     ┆ 2021-12-16 01:00:00 ┆ 3.0  │
    │ a      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 2     ┆ 2021-12-16 03:00:00 ┆ 13.0 │
    │ a      ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 04:00:00 ┆ 2021-12-16 03:00:00 ┆ 1     ┆ 2021-12-16 03:00:00 ┆ 7.0  │
    │ b      ┆ 2021-12-16 01:00:00 ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 01:00:00 ┆ 2     ┆ 2021-12-16 02:00:00 ┆ 9.0  │
    │ b      ┆ 2021-12-16 02:00:00 ┆ 2021-12-16 03:00:00 ┆ 2021-12-16 02:00:00 ┆ 1     ┆ 2021-12-16 02:00:00 ┆ 5.0  │
    └────────┴─────────────────────┴─────────────────────┴─────────────────────┴───────┴─────────────────────┴──────┘