Search code examples
python-polars

Polars searchsorted with a Series


searchsorted is an incredibly useful utility in numpy and pandas for performing a binary search on every element in a list, especially for time-series data.

import numpy as np

np.searchsorted(['a', 'a', 'b', 'c'], ['a', 'b', 'c']) # Returns [0, 2, 3]
np.searchsorted(['a', 'a', 'b', 'c'], ['a', 'b', 'c'], side='right') # Returns [2, 3, 4]

I have a few questions about Polars

  1. Is there any way to apply search_sorted on a list in polars in a vectorized manner?

  2. Is there any way to specify side=right for search_sorted?

  3. Can we use non-numeric data in search_sorted?

  4. If answer is no to the questions, what would be the recommended approach / workaround to achieve the functionalities?

    • (The ideal approach is if search_sorted can be used as part of an expression, e.g. pl.col('A').search_sorted(pl.col('B)))

Here's what I have tried:

import polars as pl

pl.Series(['a', 'a', 'b', 'c']).search_sorted(['a', 'b', 'c']) # PanicException: not implemented for Utf8

pl.Series([0, 0, 1, 2]).search_sorted([0, 1, 2]) # PanicException: dtype List not implemented

list(map(pl.Series([0, 0, 1, 2]).search_sorted, [0, 1, 2])) # Returns [1, 2, 3], different from numpy results

pl.DataFrame({
    'a': [0, 0, 1, 2],
    'b': [0, 1, 2, 3],
}).with_columns([
    pl.col('a').search_sorted(pl.col('b')).alias('c')
]) # Column C is [1, 1, 1, 1], which is incorrect

I understand Polars is still a work in progress and some functionalities are missing, so any help is greatly appreciated!


Solution

  • To extend on @ritchie46's answer, you need a rolling join so that missing values can be joined to their near neighbor. Unfortunately rolling joins don't work on letters, or more accurately String dtypes so for your example you have to do an extra step.

    Starting from:

    df1 = (pl.Series("a", ["a", "a", "b", "c"])
        .set_sorted()
        .to_frame()
        .with_row_count("idx"))
    
    df2 = pl.Series("a", ["a", "b", "c"]).set_sorted().to_frame()
    

    then we make a df to house all the possible values of a and map them to a numeric.

    dfindx=(pl.DataFrame(pl.concat([df1.get_column('a'),df2.get_column('a')]).unique())
            .sort('a').with_row_index('valindx'))
    

    now we add that valindx to each of df1 and df2

    df1=df1.join(dfindx, on='a')
    df2=df2.join(dfindx, on='a')
    

    To get almost to the finish line you'd do:

    df2.join_asof(df1, on='valindx', strategy='forward')
    

    this will leave missing the last value, the 4 from the numpy case because essentially what's happening is that the first value 'a' doesn't find a match but its nearest forward neighbor is a 'b' so it takes that value and so on but when it gets to 'e' there is nothing in df1 forward of that so we need to do a minor hack of just filling in that null with the max idx+1.

    (df2.
        join_asof(df1, on='valindx', strategy='forward')
        .with_columns(pl.col('idx').fill_null(df1.select(pl.col('idx').max()+1).item()))
        .get_column('idx'))
    

    Of course, if you're using time or numerics then you can skip the first step. Additionally, I suspect that fetching this index value is an intermediate step and that overall process would be done more efficiently without extracting the index values at all but that would be through a join_asof.

    If you change the strategy of join_asof then that should be largely the same as switching the side but you'd have to change the hack bit at the end too.