Search code examples
python-polars

Cumulative sum that resets when turning negative/positive


[enter image description here]

1

I am trying to add a column (column C) to my polars dataframe that counts how many times a value of one of the dataframe's columns (column A) is greater/less than the value of another column (column B). Once the value turns from less/greater to greater/less the cumulative sum should reset and start counting from 1/-1 again.


Solution

  • The data

    I'm going to change the data in the example you provided.

    df = pl.DataFrame(
        {
            "a": [11, 10, 10, 10, 9, 8, 8, 8, 8, 8, 15, 15, 15],
            "b": [11, 9, 9, 9, 9, 9, 10, 8, 8, 10, 11, 11, 15],
        }
    )
    print(df)
    
    shape: (13, 2)
    ┌─────┬─────┐
    │ a   ┆ b   │
    │ --- ┆ --- │
    │ i64 ┆ i64 │
    ╞═════╪═════╡
    │ 11  ┆ 11  │
    │ 10  ┆ 9   │
    │ 10  ┆ 9   │
    │ 10  ┆ 9   │
    │ 9   ┆ 9   │
    │ 8   ┆ 9   │
    │ 8   ┆ 10  │
    │ 8   ┆ 8   │
    │ 8   ┆ 8   │
    │ 8   ┆ 10  │
    │ 15  ┆ 11  │
    │ 15  ┆ 11  │
    │ 15  ┆ 15  │
    └─────┴─────┘
    

    Notice the cases where the two columns are the same. Your post didn't address what to do in these cases, so I made some assumptions as to what should happen. (You can adapt the code to handle those cases differently.)

    The algorithm

    df = (
        df
        .with_columns((pl.col("a") - pl.col("b")).sign().alias("sign_a_minus_b"))
        .with_columns(
            pl.when(pl.col("sign_a_minus_b") == 0)
            .then(None)
            .otherwise(pl.col("sign_a_minus_b"))
            .forward_fill()
            .alias("run_type")
        )
        .with_columns(
            (pl.col("run_type") != pl.col("run_type").shift(1, fill_value=0))
            .cum_sum()
            .alias("run_id")
        )
        .with_columns(pl.col("sign_a_minus_b").cum_sum().over("run_id").alias("result"))
    )
    print(df)
    
    shape: (13, 6)
    ┌─────┬─────┬────────────────┬──────────┬────────┬────────┐
    │ a   ┆ b   ┆ sign_a_minus_b ┆ run_type ┆ run_id ┆ result │
    │ --- ┆ --- ┆ ---            ┆ ---      ┆ ---    ┆ ---    │
    │ i64 ┆ i64 ┆ i64            ┆ i64      ┆ u32    ┆ i64    │
    ╞═════╪═════╪════════════════╪══════════╪════════╪════════╡
    │ 11  ┆ 11  ┆ 0              ┆ null     ┆ null   ┆ 0      │
    │ 10  ┆ 9   ┆ 1              ┆ 1        ┆ null   ┆ 1      │
    │ 10  ┆ 9   ┆ 1              ┆ 1        ┆ 0      ┆ 1      │
    │ 10  ┆ 9   ┆ 1              ┆ 1        ┆ 0      ┆ 2      │
    │ 9   ┆ 9   ┆ 0              ┆ 1        ┆ 0      ┆ 2      │
    │ 8   ┆ 9   ┆ -1             ┆ -1       ┆ 1      ┆ -1     │
    │ 8   ┆ 10  ┆ -1             ┆ -1       ┆ 1      ┆ -2     │
    │ 8   ┆ 8   ┆ 0              ┆ -1       ┆ 1      ┆ -2     │
    │ 8   ┆ 8   ┆ 0              ┆ -1       ┆ 1      ┆ -2     │
    │ 8   ┆ 10  ┆ -1             ┆ -1       ┆ 1      ┆ -3     │
    │ 15  ┆ 11  ┆ 1              ┆ 1        ┆ 2      ┆ 1      │
    │ 15  ┆ 11  ┆ 1              ┆ 1        ┆ 2      ┆ 2      │
    │ 15  ┆ 15  ┆ 0              ┆ 1        ┆ 2      ┆ 2      │
    └─────┴─────┴────────────────┴──────────┴────────┴────────┘
    

    I've left the intermediate calculations in the output, merely to show how the algorithm works. (You can drop them.)

    The basic idea is to calculate a run_id for each run of positive or negative values. We will then use the cumsum function and the over windowing expression to create a running count of positives/negatives over each run_id.

    Key assumption: ties in columns a and b do not interrupt a run, but they do not contribute to the total for that run of positive/negative values.

    sign_a_minus_b does two things: it identifies whether a run is positive/negative, and whether there is a tie in columns a and b.

    run_type extends any run to include any cases where a tie occurs in columns a and b. The null value at the top of the column was intended - it shows what happens when a tie occurs in the first row.

    result is the output column. Note that tied columns do not interrupt a run, but they don't contribute to the totals for that run.

    One final note: if ties in columns a and b are not allowed, then this algorithm can be simplified ... and run faster.