searchsorted
is an incredibly useful utility in numpy
and pandas
for performing a binary search on every element in a list, especially for time-series data.
import numpy as np
np.searchsorted(['a', 'a', 'b', 'c'], ['a', 'b', 'c']) # Returns [0, 2, 3]
np.searchsorted(['a', 'a', 'b', 'c'], ['a', 'b', 'c'], side='right') # Returns [2, 3, 4]
I have a few questions about Polars
Is there any way to apply search_sorted
on a list in polars in a vectorized manner?
Is there any way to specify side=right
for search_sorted
?
Can we use non-numeric data in search_sorted
?
If answer is no to the questions, what would be the recommended approach / workaround to achieve the functionalities?
search_sorted
can be used as part of an expression, e.g. pl.col('A').search_sorted(pl.col('B))
)Here's what I have tried:
import polars as pl
pl.Series(['a', 'a', 'b', 'c']).search_sorted(['a', 'b', 'c']) # PanicException: not implemented for Utf8
pl.Series([0, 0, 1, 2]).search_sorted([0, 1, 2]) # PanicException: dtype List not implemented
list(map(pl.Series([0, 0, 1, 2]).search_sorted, [0, 1, 2])) # Returns [1, 2, 3], different from numpy results
pl.DataFrame({
'a': [0, 0, 1, 2],
'b': [0, 1, 2, 3],
}).with_columns([
pl.col('a').search_sorted(pl.col('b')).alias('c')
]) # Column C is [1, 1, 1, 1], which is incorrect
I understand Polars is still a work in progress and some functionalities are missing, so any help is greatly appreciated!
To extend on @ritchie46's answer, you need a rolling join so that missing values can be joined to their near neighbor. Unfortunately rolling joins don't work on letters, or more accurately String
dtypes so for your example you have to do an extra step.
Starting from:
df1 = (pl.Series("a", ["a", "a", "b", "c"])
.set_sorted()
.to_frame()
.with_row_count("idx"))
df2 = pl.Series("a", ["a", "b", "c"]).set_sorted().to_frame()
then we make a df to house all the possible values of a
and map them to a numeric.
dfindx=(pl.DataFrame(pl.concat([df1.get_column('a'),df2.get_column('a')]).unique())
.sort('a').with_row_index('valindx'))
now we add that valindx
to each of df1
and df2
df1=df1.join(dfindx, on='a')
df2=df2.join(dfindx, on='a')
To get almost to the finish line you'd do:
df2.join_asof(df1, on='valindx', strategy='forward')
this will leave missing the last value, the 4
from the numpy case because essentially what's happening is that the first value 'a' doesn't find a match but its nearest forward neighbor is a 'b' so it takes that value and so on but when it gets to 'e' there is nothing in df1 forward of that so we need to do a minor hack of just filling in that null with the max idx+1.
(df2.
join_asof(df1, on='valindx', strategy='forward')
.with_columns(pl.col('idx').fill_null(df1.select(pl.col('idx').max()+1).item()))
.get_column('idx'))
Of course, if you're using time or numerics then you can skip the first step. Additionally, I suspect that fetching this index value is an intermediate step and that overall process would be done more efficiently without extracting the index values at all but that would be through a join_asof
.
If you change the strategy
of join_asof
then that should be largely the same as switching the side
but you'd have to change the hack bit at the end too.