Search code examples
pythonnullpython-polarsrolling-computation

Polars - How to extract last non-null value on a given column


I'd like to perform the following:

Input:

df = pl.DataFrame({
    "a": [1,15,None,20,None]
})

Output:

df = pl.DataFrame({
    "a": [1,15,None,20,None],
    "b": [0,14,None,5,None]
})

That is, from:

A
1
15
None
20
None

to:

A B
1 0
15 14
None None
20 5
None None

So, what it does:

  1. If the value of "A" is null, then value of B (output column) is also Null
  2. If "A" has some value, please retrieve the last Non-Null value in "A", and then subtract the current value in "A" with the previous Non-Null value

I'd like to perform this in python's polars dataframe library, but I can't seem to find a solution.

I've tried the following question:

How to select the last non-null value from one column and also the value from another column on the same row in Polars?

But unfortunately, this does not answer the original problem, since the question performs an aggregation of an entire column, and then takes the last value of that column.

What I'd like to do is not to aggregate an entire column, but simply to subtract a current value with a previous non-null value.

I have also tried to use rolling:

df = df.with_row_index().rolling(
    index_column = 'index',
    period = '???i').agg(pl.col("A").last())

But, of course, that does not work because the occurence of Null Values cannot be determined (i.e. it is not periodic, so I don't know how many indexes before the current entry contains a non-null value in "A").

Does anyone knows how to do so?

Thanks!


Solution

  • You can use a combination of shift and forward_fill to get the last non-null value.

    So with your input, this looks like

    df = pl.DataFrame({
        "a": [1, 15, None, 20, None]
    })
    df.with_columns(
        # current row value for "a" minus the last non-null value
        # as the first row has no previous non-null value,
        # fill it with the first value (1)
        (
            pl.col("a") - 
            pl.col("a").shift(fill_value=pl.col("a").first()).forward_fill()
        ).alias("b")
    )
    # shape: (5, 2)
    # ┌──────┬──────┐
    # │ a    ┆ b    │
    # │ ---  ┆ ---  │
    # │ i64  ┆ i64  │
    # ╞══════╪══════╡
    # │ 1    ┆ 0    │
    # │ 15   ┆ 14   │
    # │ null ┆ null │
    # │ 20   ┆ 5    │
    # │ null ┆ null │
    # └──────┴──────┘