Given a polars DataFrame
data = pl.DataFrame({"user_id": [1, 1, 1, 2, 2, 2], "login": [False, True, False, False, False, True]})
How could I add a column which adds the number of rows since the user last logged in, with any rows before a login for that user being set to None?
Example output for the above data is
[None, 0, 1, None, None, 0]
I tried adapting the answers from here, but cannot get it working with groups
You could subtract the row number of the previous login.
(df.with_row_index()
.with_columns(distance =
pl.col.index - pl.when("login").then("index").forward_fill().over("user_id")
)
)
shape: (6, 4)
┌───────┬─────────┬───────┬──────────┐
│ index ┆ user_id ┆ login ┆ distance │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ bool ┆ u32 │
╞═══════╪═════════╪═══════╪══════════╡
│ 0 ┆ 1 ┆ false ┆ null │
│ 1 ┆ 1 ┆ true ┆ 0 │
│ 2 ┆ 1 ┆ false ┆ 1 │
│ 3 ┆ 2 ┆ false ┆ null │
│ 4 ┆ 2 ┆ false ┆ null │
│ 5 ┆ 2 ┆ true ┆ 0 │
└───────┴─────────┴───────┴──────────┘
If the data is unsorted, you can use .int_range()
as the row number instead.
df.with_columns(distance =
pl.int_range(pl.len()).over("user_id")
- pl.when("login").then(pl.int_range(pl.len())).forward_fill().over("user_id")
)
(df.with_row_index()
.with_columns(
new_index = pl.when("login").then("index"),
last_index = pl.when("login").then("index").forward_fill(),
last_index_groupwise = pl.when("login").then("index").forward_fill().over("user_id")
)
)
shape: (6, 6)
┌───────┬─────────┬───────┬───────────┬────────────┬──────────────────────┐
│ index ┆ user_id ┆ login ┆ new_index ┆ last_index ┆ last_index_groupwise │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ bool ┆ u32 ┆ u32 ┆ u32 │
╞═══════╪═════════╪═══════╪═══════════╪════════════╪══════════════════════╡
│ 0 ┆ 1 ┆ false ┆ null ┆ null ┆ null │
│ 1 ┆ 1 ┆ true ┆ 1 ┆ 1 ┆ 1 │
│ 2 ┆ 1 ┆ false ┆ null ┆ 1 ┆ 1 │
│ 3 ┆ 2 ┆ false ┆ null ┆ 1 ┆ null │
│ 4 ┆ 2 ┆ false ┆ null ┆ 1 ┆ null │
│ 5 ┆ 2 ┆ true ┆ 5 ┆ 5 ┆ 5 │
└───────┴─────────┴───────┴───────────┴────────────┴──────────────────────┘
when/then
to create the null/non-null sequence.forward_fill()
to bring the previous value forward.over()
to make it a groupwise operation