Search code examples
pythonpython-polars

Polars - How to create dynamic rolling window for calculations


Consider the following dataframe of sensor readings laid out along a straight line:

df = pl.from_repr("""
┌─────────────────────┬───────────────────┬──────────┐
│ Location Start (KM) ┆ Location End (KM) ┆ Readings │
│ ---                 ┆ ---               ┆ ---      │
│ f64                 ┆ f64               ┆ f64      │
╞═════════════════════╪═══════════════════╪══════════╡
│ 1.0                 ┆ 1.1               ┆ 7.0      │
│ 1.1                 ┆ 1.23              ┆ null     │
│ 1.23                ┆ 1.3               ┆ 8.0      │
│ 1.3                 ┆ 1.34              ┆ null     │
│ 1.34                ┆ 1.4               ┆ null     │
│ 1.4                 ┆ 1.5               ┆ 5.0      │
│ 1.5                 ┆ 1.65              ┆ 6.0      │
└─────────────────────┴───────────────────┴──────────┘
""")

I am trying to create a rolling 150m lookahead window to calculate the percentage of non nulls within that window, expecting the results to be look a little like the below:

┌─────────────────────┬───────────────────┬──────────┬─────────────────────────────────┐
│ Location Start (KM) ┆ Location End (KM) ┆ Readings ┆ Rolling % Non Null Readings (%) │
│ ---                 ┆ ---               ┆ ---      ┆ ---                             │
│ f64                 ┆ f64               ┆ f64      ┆ i64                             │
╞═════════════════════╪═══════════════════╪══════════╪═════════════════════════════════╡
│ 1.0                 ┆ 1.1               ┆ 7.0      ┆ 67                              │
│ 1.1                 ┆ 1.23              ┆ null     ┆ 13                              │
│ 1.23                ┆ 1.3               ┆ 8.0      ┆ 47                              │
│ 1.3                 ┆ 1.34              ┆ null     ┆ 33                              │
│ 1.34                ┆ 1.4               ┆ null     ┆ 60                              │
│ 1.4                 ┆ 1.5               ┆ 5.0      ┆ 100                             │
│ 1.5                 ┆ 1.65              ┆ 6.0      ┆ 100                             │
└─────────────────────┴───────────────────┴──────────┴─────────────────────────────────┘

Alternatively, a centered window would also work (but the example above is for a look-ahead window)

It seems that the style of creating a dynamic rolling window for the above is somewhat supported via the Polars group_by_dynamic but that seems to only work for temporal values, whereas the values in my columns are floats since they represent spatial locations. The rolling_map method also seems to provide some means to an end, however it creates a rolling window over a fixed number of rows, which doesn't quite suit this use case as well, as the number of rows to include in the window will defer depending on certain conditions (In this case the length of a particular reading)

How should I go about performing the following rolling calculations? I am trying to avoid writing explicit loops to loop through each row and check multiple conditions, but I can't seem to figure out how to do so with the inbuilt methods.

I have attached a simple working example below written in Pandas that does the job, but in a naive loop-heavy way and was hoping to implement a faster version in Polars using some variant of its group_by_dynamic or rolling function:

df=pd.DataFrame({'Location Start (KM)':[1,1.1,1.23,1.3,1.34,1.4,1.5],
                'Location End (KM)':[1.1,1.23,1.3,1.34,1.4,1.5,1.65],
                'Readings':[7,np.nan,8,np.nan,np.nan,5,6]
                }) #sample dataframe
#create manual for loop to simulate a rolling forward looking window

def calculate_perc_non_null(row,window_length):
    '''
    Naive function that calculates the percentage of non null readings within a forward rolling window of a dataframe. Takes in:
    Row: pandas series object representing each row of the dataframe
    window_length: float that describes the window length in KMs   
    '''
    window_start=row['Location Start (KM)']
    window_end=window_start+window_length #generate window endpoints
    eligible_readings=df.loc[(df['Location Start (KM)']>=window_start)&(df['Location Start (KM)']<=window_end)]#readings that fall within the specified window
    #calculate number of non-nulls
    nulls=eligible_readings.loc[eligible_readings['Readings'].isnull()].copy() #grab all null values
    nulls.loc[nulls['Location End (KM)']>window_end,'Location End (KM)']=window_end #truncate ends to ensure no values outside the window are taken in
    total_length_of_nulls=(nulls['Location End (KM)']-nulls['Location Start (KM)']).sum()
    non_null_perc=100*(1-total_length_of_nulls/window_length) #calculate percentage of non null as a percentage
    return non_null_perc
df['Rolling % Non Null Readings (%)']=df.apply(lambda x:calculate_perc_non_null(x,window_length=0.15),axis=1)

Which outputs:


 Location Start (KM)  Location End (KM)  Readings  \
0                 1.00               1.10       7.0   
1                 1.10               1.23       NaN   
2                 1.23               1.30       8.0   
3                 1.30               1.34       NaN   
4                 1.34               1.40       NaN   
5                 1.40               1.50       5.0   
6                 1.50               1.65       6.0   

   Rolling % Non Null Readings (%)  
0                        66.666667  
1                        13.333333  
2                        46.666667  
3                        33.333333  
4                        60.000000  
5                       100.000000  
6                       100.000000  

Edit on proposed solution (30/11/2022):

ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ's solution from 30/11/22 works perfectly fine, I've attached a few performance benchmarks below (not apples to apples because different dataframe libraries are used here but it should illustrate how quick this runs)

Original pandas code:

11.3 ms ± 1.95 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ's solution:

527 µs ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Or about 25x speedup, the performance gaps widen when the underlying dataframe gets larger (I used a 1000x larger dummy frame): Original pandas code:

10 s ± 580 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ's solution:

12.2 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Or 820x faster, so it scales well.


Solution

  • I may not understand what you need, but let's see if we can get the ball rolling.

    Let's start by using rolling and see if this gets us close to what is needed.

    First, let's convert the floats (kilometers) to integers (meters). That allows us to specify our period as an integer: 150i. We'll then calculate a duration (loc_end - loc_start) that we'll use in our calculations.

    Also, we'll create a boolean (is_not_null) which will automatically be upcast to 0 or 1 in our calculations.

    The basic idea is to allow rolling to calculate each window. We'll use a dot product to calculate our weighted values.

    Since each window will likely be more than 150 meters, we'll need to back out some overage from the last line that represents the amount over 150 (before dividing by 150)

    Here's what this would look like. (Note: I've also asked Polars to include the values it sees in each window -- as lists -- so that the results can be more easily inspected).

    window_size = 150
    (
        df
        .with_columns(
            (pl.col('^loc.*$') * 1_000).cast(pl.Int64),
            pl.col('readings').is_not_null().alias('is_not_null'),
        )
        .with_columns(
            (pl.col('loc_end') - pl.col('loc_start')).alias('duration'),
        )
        .rolling(
            index_column='loc_start',
            period=str(window_size) + "i",
            offset="0i",
            closed="left",
        )
        .agg(
            (
                (
                    pl.col('duration').dot('is_not_null') -
                    (pl.sum('duration') - window_size) * pl.col('is_not_null').last()
                ) / window_size
            ).alias('result'),
            pl.all().name.suffix('_val_list'),
        )
    )
    
    shape: (7, 6)
    ┌───────────┬──────────┬────────────────────┬───────────────────┬──────────────────────┬───────────────────┐
    │ loc_start ┆ result   ┆ loc_end_val_list   ┆ readings_val_list ┆ is_not_null_val_list ┆ duration_val_list │
    │ ---       ┆ ---      ┆ ---                ┆ ---               ┆ ---                  ┆ ---               │
    │ i64       ┆ f64      ┆ list[i64]          ┆ list[f64]         ┆ list[bool]           ┆ list[i64]         │
    ╞═══════════╪══════════╪════════════════════╪═══════════════════╪══════════════════════╪═══════════════════╡
    │ 1000      ┆ 0.666667 ┆ [1100, 1230]       ┆ [7.0, null]       ┆ [true, false]        ┆ [100, 130]        │
    │ 1100      ┆ 0.133333 ┆ [1230, 1300]       ┆ [null, 8.0]       ┆ [false, true]        ┆ [130, 70]         │
    │ 1230      ┆ 0.466667 ┆ [1300, 1340, 1400] ┆ [8.0, null, null] ┆ [true, false, false] ┆ [70, 40, 60]      │
    │ 1300      ┆ 0.333333 ┆ [1340, 1400, 1500] ┆ [null, null, 5.0] ┆ [false, false, true] ┆ [40, 60, 100]     │
    │ 1340      ┆ 0.6      ┆ [1400, 1500]       ┆ [null, 5.0]       ┆ [false, true]        ┆ [60, 100]         │
    │ 1400      ┆ 1.0      ┆ [1500, 1650]       ┆ [5.0, 6.0]        ┆ [true, true]         ┆ [100, 150]        │
    │ 1500      ┆ 1.0      ┆ [1650]             ┆ [6.0]             ┆ [true]               ┆ [150]             │
    └───────────┴──────────┴────────────────────┴───────────────────┴──────────────────────┴───────────────────┘
    

    Note: the third value in the result column (0.46667) differs from the result in your posting (0.73).

    Does this get the ball rolling?