Search code examples
pythonpython-polars

Want to downsample timeseries dataframe with Polars group_by_dynamic including empty bins


I'm trying to downsample a dataframe in polars which has a column with timestamps. I want to downsample the column to intervals of 1s. But when i do this using group_by_dynamic only the seconds which have values are coming in the result. But I want the result to contain even seconds which has no entries so I can forward fill.

I tried group_by_dynamic, and it doesnt seem to do what I want.

taq_df.group_by_dynamic("timestamp", every = '1s').agg(pl.exclude("bestex_timestamp").last())

Solution

  • TLDR. Applying pl.DataFrame.upsample after the aggregation will fill the missing timestamps.

    Consider the following example frame.

    import polars as pl
    from datetime import datetime
    
    df = pl.DataFrame({
        "timestamp": [
            datetime(2024, 1, 1, 0, 0, 0, 0), 
            datetime(2024, 1, 1, 0, 0, 2, 5),
            datetime(2024, 1, 1, 0, 0, 3, 0), 
            datetime(2024, 1, 1, 0, 0, 5, 1),
        ],
        "value": [1, 2, 3, 4],
    }).set_sorted("timestamp")
    
    shape: (4, 2)
    ┌────────────────────────────┬───────┐
    │ timestamp                  ┆ value │
    │ ---                        ┆ ---   │
    │ datetime[μs]               ┆ i64   │
    ╞════════════════════════════╪═══════╡
    │ 2024-01-01 00:00:00        ┆ 1     │
    │ 2024-01-01 00:00:02.000005 ┆ 2     │
    │ 2024-01-01 00:00:03        ┆ 3     │
    │ 2024-01-01 00:00:05.000001 ┆ 4     │
    └────────────────────────────┴───────┘
    

    Simply applying pl.DataFrame.group_by_dynamic as suggested in the question, results in the following dataframe (with missing timestamps for seconds 1 and 4).

    (
        df
        .group_by_dynamic(index_column="timestamp", every="1s")
        .agg(pl.col("value").last())
    )
    
    shape: (4, 2)
    ┌─────────────────────┬───────┐
    │ timestamp           ┆ value │
    │ ---                 ┆ ---   │
    │ datetime[μs]        ┆ i64   │
    ╞═════════════════════╪═══════╡
    │ 2024-01-01 00:00:00 ┆ 1     │
    │ 2024-01-01 00:00:02 ┆ 2     │
    │ 2024-01-01 00:00:03 ┆ 3     │
    │ 2024-01-01 00:00:05 ┆ 4     │
    └─────────────────────┴───────┘
    

    Now, applying pl.DataFrame.upsample fills the missing timestamps. All other column will have null values for the new rows.

    (
        df
        .group_by_dynamic(index_column="timestamp", every="1s")
        .agg(pl.col("value").last())
        .upsample("timestamp", every="1s")
    )
    
    shape: (6, 2)
    ┌─────────────────────┬───────┐
    │ timestamp           ┆ value │
    │ ---                 ┆ ---   │
    │ datetime[μs]        ┆ i64   │
    ╞═════════════════════╪═══════╡
    │ 2024-01-01 00:00:00 ┆ 1     │
    │ 2024-01-01 00:00:01 ┆ null  │
    │ 2024-01-01 00:00:02 ┆ 2     │
    │ 2024-01-01 00:00:03 ┆ 3     │
    │ 2024-01-01 00:00:04 ┆ null  │
    │ 2024-01-01 00:00:05 ┆ 4     │
    └─────────────────────┴───────┘