What is the fastest way to do "indexed" look-ups in Polars?

I am working with large polars dataframes which are fully loaded in memory. Each row is uniquely indexed by columns entityId (Int64) and entryDate (date).

I know poalars does not have indexes, but I still need to do ad-hoc look-ups of data against these tables, and it is frequent enough that it's taking a non-trivial % of my application's runtime.

Currently I am finding these rows by using .filter

def locate(df, entityId, entryDate)->pl.DataFrame:
    return df.filter(pl.col('entityId')==entityId).filter(pl.col('entryDate') == entryDate)

This is fairly fast, but it still takes 50 to 100ms to look up a row.

Are there any optimizations I'm missing?

Some things I tried:

using .lazy / .collect (no change)
sorting by entityId (no change)

I'm on polars 0.17.12

Solution

I'm going to start with

size=1000000
df=pl.DataFrame({'entityId':np.random.randint(0,200,size=size),
                'entryDate':[datetime(2022,1,1).date() + timedelta(days=int(np.random.randint(1,200,1)[0])) for _ in range(size)],
                'blah':np.random.random(size=size)})

%%timeit
df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())

1.45 ms ± 14 µs per loop

That's the benchmark.

The first improvement you can make is to combine your filters instead of chaining them...

%%timeit
df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))

996 µs ± 16.6 µs per loop

Now let's try to sort the columns...

df=df.sort(['entityId', 'entryDate'])

After sorting...

%%timeit
df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())

1.33 ms ± 23.7 µs per loop

barely an improvement over the initial 1.45ms

%%timeit
df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))

1.04 ms ± 35.5 µs per loop

not sure why that took longer on average although they overlap in 1 std dev so basically the same.

now let's try making a struct column out of the two columns which we'll call INDEX

df=df.select(pl.struct(['entityId', 'entryDate']).alias('INDEX'), pl.exclude(['entityId', 'entryDate'])).sort('INDEX')


%%timeit
df.filter((pl.col('INDEX').struct.field('entityId')==34) & (pl.col('INDEX').struct.field('entryDate')==datetime(2022,2,16).date()))
946 µs ± 18.4 µs per loop

That's a slight improvement.

Let's try partitioning the df by entityId

dfdicts=df.unnest('INDEX').partition_by('entityId',as_dict=True)

%%timeit
dfdicts[34].filter(pl.col('entryDate')==datetime(2022,2,16).date())

445 µs ± 6.51 µs per loop

That's a big improvement!! Let's try partitioning the other way

dfdicts2=df.unnest('INDEX').partition_by('entryDate',as_dict=True)

%%timeit
dfdicts2[datetime(2022,2,16).date()].filter(pl.col('entityId')==34)

382 µs ± 7.23 µs per loop

The search_sorted trick only works when you want to get a single row and in my random data it's been returning multiple rows. However if you're only looking to get a single row you could do

%%timeit
startrow=dfdicts2[datetime(2022,2,16).date()].get_column('entityId').search_sorted(34, side='left')
dfdicts2[datetime(2022,2,16).date()].slice(startrow, 1)

333 µs ± 8.31 µs per loop

There's a slight improvement in the search_sorted trick against the regular filter but not as big as here

I'm guessing that since that answer was posted that filter was made to be much more efficient so that's one reason why there's not much of an improvement. In this case the improvement might just be because it's only returning one row.

TL;DR

If you've got enough memory then partition by one of the columns, try both since the size will matter on which will give better performance. The search_sorted trick will only work on numbers, not dates, so it'll only be available if you partition by the date field. If you don't have the memory for the partition_by then combining your columns into a sorted struct plus keeping both filter expressions in one call to filter then that should be the next best. You can easily unnest('INDEX') to get the original shape back.

As an aside If you can refactor whatever you're doing downstream into a groupby.agg and/or join then that would probably be much better in particular because of parallelization.