I have the following dataframe:
df = pl.DataFrame({
"Column A": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Column B": [2, 3, 1, 4, 1, 7, 3, 2, 12, 0]
})
I want to create a new column C that holds the distance, in rows, between the B value of the current row and the next value in column B that is greater than or equal to B + 50%.
The end result should look like this:
df = pl.DataFrame({
"Column A": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"Column B": [2, 3, 1, 4, 1, 7, 3, 2, 12, 0],
"Column C": [1, 4, 1, 2, 1, 3, 2, 1, None, None]
})
How can I efficiently achieve this using Polars, especially since I'm working with a large DataFrame?
Ok, so first I should say - this one looks like it requires join on inequality on multiple columns and from what I've found pure polars is not great with it. It's probably possible to do it with join_asof
but I couldn't make it pretty.
I'd probably use duckdb
integration with polars to achieve the results:
import duckdb
duckdb.sql("""
select
d."Column A",
d."Column B",
(
select tt."Column A"
from df as tt
where tt."Column A" > d."Column A" and tt."Column B" >= d."Column B" * 1.5
order by tt."Column A" asc
limit 1
) - d."Column A" as "Column C"
from df as d
""").pl()
┌──────────┬──────────┬──────────┐
│ Column A ┆ Column B ┆ Column C │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════════╪══════════╪══════════╡
│ 1 ┆ 2 ┆ 1 │
│ 2 ┆ 3 ┆ 4 │
│ 3 ┆ 1 ┆ 1 │
│ 4 ┆ 4 ┆ 2 │
│ 5 ┆ 1 ┆ 1 │
│ 6 ┆ 7 ┆ 3 │
│ 7 ┆ 3 ┆ 2 │
│ 8 ┆ 2 ┆ 1 │
│ 9 ┆ 12 ┆ null │
│ 10 ┆ 0 ┆ null │
└──────────┴──────────┴──────────┘