Suppose my data looks like this:
data = {
'value': [1,9,6,7,3, 2,4,5,1,9]
}
For each row, I would like to find the row number of the latest previous element larger than the current one.
So, my expected output is:
[None, 0, 1, 2, 1, 1, 3, 4, 1, 0]
1
has no previous element, so I want None
in the result9
is at least as large than all its previous elements, so I want 0
in the result6
, has its previous element 9
which is larger than it. The distance between them is 1
. So, I want 1
in the result here.I'm aware that I can do this in a loop in Python (or in C / Rust if I write an extension).
My question: is it possible to solve this using entirely dataframe operations? pandas or Polars, either is fine. But only dataframe operations.
So, none of the following please:
apply
map_elements
map_rows
iter_rows
This iterates only on the range of rows that this should look. It doesn't loop over the rows themselves in python. If your initial bound_range
covers all the cases then it won't ever actually do a loop.
lb=0
bound_range=3
df=df.with_columns(z=pl.lit(None, dtype=pl.UInt64))
while True:
df=df.with_columns(
z=pl.when(pl.col('value')>=pl.col('value').shift(1).cum_max())
.then(pl.lit(0, dtype=pl.UInt64))
.when(pl.col('z').is_null())
.then(
pl.coalesce(
pl.when(pl.col('value')<pl.col('value').shift(x))
.then(pl.lit(x, dtype=pl.UInt64))
for x in range(lb, lb+bound_range)
)
)
.otherwise(pl.col('z'))
)
if df[1:]['z'].drop_nulls().shape[0]==df.shape[0]-1:
break
lb+=bound_range
For this example I set bound_range
to 3 to make sure it loops at least once. I ran this with 1M random integers between 0 and 9(inclusive) and I set the bound_range to 50 and it took under 2 sec. You could make this smarter in between loops by checking things more explicitly but the best approach there would be data dependent.