Search code examples
pythonpandaspython-polars

pandas or Polars: find index of previous element larger than current one


Suppose my data looks like this:

data = {
    'value': [1,9,6,7,3, 2,4,5,1,9]
}

For each row, I would like to find the row number of the latest previous element larger than the current one.

So, my expected output is:

[None, 0, 1, 2, 1, 1, 3, 4, 1, 0]
  • the first element 1 has no previous element, so I want None in the result
  • the next element 9 is at least as large than all its previous elements, so I want 0 in the result
  • the next element 6, has its previous element 9 which is larger than it. The distance between them is 1. So, I want 1 in the result here.

I'm aware that I can do this in a loop in Python (or in C / Rust if I write an extension).

My question: is it possible to solve this using entirely dataframe operations? pandas or Polars, either is fine. But only dataframe operations.

So, none of the following please:

  • apply
  • map_elements
  • map_rows
  • iter_rows
  • Python for loops which loop over the rows and extract elements one-by-one from the dataframes

Solution

  • This iterates only on the range of rows that this should look. It doesn't loop over the rows themselves in python. If your initial bound_range covers all the cases then it won't ever actually do a loop.

    lb=0
    bound_range=3
    df=df.with_columns(z=pl.lit(None, dtype=pl.UInt64))
    while True:
        df=df.with_columns(
            z=pl.when(pl.col('value')>=pl.col('value').shift(1).cum_max())
                .then(pl.lit(0, dtype=pl.UInt64))
                .when(pl.col('z').is_null())
                .then(
                    pl.coalesce(
                        pl.when(pl.col('value')<pl.col('value').shift(x))
                            .then(pl.lit(x, dtype=pl.UInt64))
                            for x in range(lb, lb+bound_range)
                    )
                )
                .otherwise(pl.col('z'))
                )
        if df[1:]['z'].drop_nulls().shape[0]==df.shape[0]-1:
            break
        lb+=bound_range
    
    

    For this example I set bound_range to 3 to make sure it loops at least once. I ran this with 1M random integers between 0 and 9(inclusive) and I set the bound_range to 50 and it took under 2 sec. You could make this smarter in between loops by checking things more explicitly but the best approach there would be data dependent.