I am working with large polars dataframes which are fully loaded in memory. Each row is uniquely indexed by columns entityId (Int64) and entryDate (date).
I know poalars does not have indexes, but I still need to do ad-hoc look-ups of data against these tables, and it is frequent enough that it's taking a non-trivial % of my application's runtime.
Currently I am finding these rows by using .filter
def locate(df, entityId, entryDate)->pl.DataFrame:
return df.filter(pl.col('entityId')==entityId).filter(pl.col('entryDate') == entryDate)
This is fairly fast, but it still takes 50 to 100ms to look up a row.
Are there any optimizations I'm missing?
Some things I tried:
.lazy
/ .collect
(no change)I'm on polars 0.17.12
I'm going to start with
size=1000000
df=pl.DataFrame({'entityId':np.random.randint(0,200,size=size),
'entryDate':[datetime(2022,1,1).date() + timedelta(days=int(np.random.randint(1,200,1)[0])) for _ in range(size)],
'blah':np.random.random(size=size)})
%%timeit
df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())
1.45 ms ± 14 µs per loop
That's the benchmark.
The first improvement you can make is to combine your filters instead of chaining them...
%%timeit
df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))
996 µs ± 16.6 µs per loop
Now let's try to sort the columns...
df=df.sort(['entityId', 'entryDate'])
After sorting...
%%timeit
df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())
1.33 ms ± 23.7 µs per loop
barely an improvement over the initial 1.45ms
%%timeit
df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))
1.04 ms ± 35.5 µs per loop
not sure why that took longer on average although they overlap in 1 std dev so basically the same.
now let's try making a struct column out of the two columns which we'll call INDEX
df=df.select(pl.struct(['entityId', 'entryDate']).alias('INDEX'), pl.exclude(['entityId', 'entryDate'])).sort('INDEX')
%%timeit
df.filter((pl.col('INDEX').struct.field('entityId')==34) & (pl.col('INDEX').struct.field('entryDate')==datetime(2022,2,16).date()))
946 µs ± 18.4 µs per loop
That's a slight improvement.
Let's try partitioning the df by entityId
dfdicts=df.unnest('INDEX').partition_by('entityId',as_dict=True)
%%timeit
dfdicts[34].filter(pl.col('entryDate')==datetime(2022,2,16).date())
445 µs ± 6.51 µs per loop
That's a big improvement!! Let's try partitioning the other way
dfdicts2=df.unnest('INDEX').partition_by('entryDate',as_dict=True)
%%timeit
dfdicts2[datetime(2022,2,16).date()].filter(pl.col('entityId')==34)
382 µs ± 7.23 µs per loop
The search_sorted
trick only works when you want to get a single row and in my random data it's been returning multiple rows. However if you're only looking to get a single row you could do
%%timeit
startrow=dfdicts2[datetime(2022,2,16).date()].get_column('entityId').search_sorted(34, side='left')
dfdicts2[datetime(2022,2,16).date()].slice(startrow, 1)
333 µs ± 8.31 µs per loop
There's a slight improvement in the search_sorted
trick against the regular filter but not as big as here
I'm guessing that since that answer was posted that filter
was made to be much more efficient so that's one reason why there's not much of an improvement. In this case the improvement might just be because it's only returning one row.
TL;DR
If you've got enough memory then partition by one of the columns, try both since the size will matter on which will give better performance. The search_sorted
trick will only work on numbers, not dates, so it'll only be available if you partition by the date field. If you don't have the memory for the partition_by then combining your columns into a sorted struct plus keeping both filter expressions in one call to filter then that should be the next best. You can easily unnest('INDEX')
to get the original shape back.
As an aside If you can refactor whatever you're doing downstream into a groupby.agg and/or join then that would probably be much better in particular because of parallelization.