Search code examples
pythondataframepython-polars

Upsampling a polars dataframe with groupby forward fill only some columns?


Is it possible to fill_forward only some columns when upsampling using Polars?

For example, would like to fill in the missing dates in sample dataframe (see code below). 'upsample' and 'forward_fill' works beautifully and is blazing fast with a much larger dataset. The output is as expected in the table below. No issues, all as expected.

However: Is it possible to exclude a column for the forward_fill, so for example it returns blanks in the utc_time column instead of filling the time? I have tried listing the columns in select statement by replacing pl.all() with pl.col(...), but that just removes the column that is not listed.

import polars as pl
from datetime import datetime

df = pl.DataFrame(  
    {  
     'utc_created':pl.date_range(datetime(2021, 12, 16), datetime(2021, 12, 22, 0), interval="2d", eager=True),  
     'utc_time':['21:12:06','21:20:06','17:51:10','03:54:49'],  
     'sku':[9100000801,9100000801,9100000801,9100000801],  
     'old':[18,17,16,15],  
     'new':[17,16,15,14],  
     'alert_type':['Inventory','Inventory','Inventory','Inventory'],  
     'alert_level':['Info','Info','Info','Info']  
    }  
    )

df = (df.upsample(time_column='utc_created',every='1d', group_by='sku')
        .select(pl.all().forward_fill())
)

Returns:

shape: (7, 7)
┌─────────────┬──────────┬────────────┬─────┬─────┬────────────┬─────────────┐
│ utc_created ┆ utc_time ┆ sku        ┆ old ┆ new ┆ alert_type ┆ alert_level │
│ ---         ┆ ---      ┆ ---        ┆ --- ┆ --- ┆ ---        ┆ ---         │
│ date        ┆ str      ┆ i64        ┆ i64 ┆ i64 ┆ str        ┆ str         │
╞═════════════╪══════════╪════════════╪═════╪═════╪════════════╪═════════════╡
│ 2021-12-16  ┆ 21:12:06 ┆ 9100000801 ┆ 18  ┆ 17  ┆ Inventory  ┆ Info        │
│ 2021-12-17  ┆ 21:12:06 ┆ 9100000801 ┆ 18  ┆ 17  ┆ Inventory  ┆ Info        │
│ 2021-12-18  ┆ 21:20:06 ┆ 9100000801 ┆ 17  ┆ 16  ┆ Inventory  ┆ Info        │
│ 2021-12-19  ┆ 21:20:06 ┆ 9100000801 ┆ 17  ┆ 16  ┆ Inventory  ┆ Info        │
│ 2021-12-20  ┆ 17:51:10 ┆ 9100000801 ┆ 16  ┆ 15  ┆ Inventory  ┆ Info        │
│ 2021-12-21  ┆ 17:51:10 ┆ 9100000801 ┆ 16  ┆ 15  ┆ Inventory  ┆ Info        │
│ 2021-12-22  ┆ 03:54:49 ┆ 9100000801 ┆ 15  ┆ 14  ┆ Inventory  ┆ Info        │
└─────────────┴──────────┴────────────┴─────┴─────┴────────────┴─────────────┘

Solution

  • You can use pl.exclude() and .forward_fill() e.g.

    df = (
       df.upsample(time_column="utc_created", every="1d", group_by="sku")
         .with_columns(pl.exclude("utc_time").forward_fill())
    )
    
    shape: (7, 7)
    ┌─────────────┬──────────┬────────────┬─────┬─────┬────────────┬─────────────┐
    │ utc_created ┆ utc_time ┆ sku        ┆ old ┆ new ┆ alert_type ┆ alert_level │
    │ ---         ┆ ---      ┆ ---        ┆ --- ┆ --- ┆ ---        ┆ ---         │
    │ date        ┆ str      ┆ i64        ┆ i64 ┆ i64 ┆ str        ┆ str         │
    ╞═════════════╪══════════╪════════════╪═════╪═════╪════════════╪═════════════╡
    │ 2021-12-16  ┆ 21:12:06 ┆ 9100000801 ┆ 18  ┆ 17  ┆ Inventory  ┆ Info        │
    │ 2021-12-17  ┆ null     ┆ 9100000801 ┆ 18  ┆ 17  ┆ Inventory  ┆ Info        │
    │ 2021-12-18  ┆ 21:20:06 ┆ 9100000801 ┆ 17  ┆ 16  ┆ Inventory  ┆ Info        │
    │ 2021-12-19  ┆ null     ┆ 9100000801 ┆ 17  ┆ 16  ┆ Inventory  ┆ Info        │
    │ 2021-12-20  ┆ 17:51:10 ┆ 9100000801 ┆ 16  ┆ 15  ┆ Inventory  ┆ Info        │
    │ 2021-12-21  ┆ null     ┆ 9100000801 ┆ 16  ┆ 15  ┆ Inventory  ┆ Info        │
    │ 2021-12-22  ┆ 03:54:49 ┆ 9100000801 ┆ 15  ┆ 14  ┆ Inventory  ┆ Info        │
    └─────────────┴──────────┴────────────┴─────┴─────┴────────────┴─────────────┘