Search code examples
indexinglookuppython-polars

What is the fastest way to do "indexed" look-ups in Polars?


I am working with large polars dataframes which are fully loaded in memory. Each row is uniquely indexed by columns entityId (Int64) and entryDate (date).

I know poalars does not have indexes, but I still need to do ad-hoc look-ups of data against these tables, and it is frequent enough that it's taking a non-trivial % of my application's runtime.

Currently I am finding these rows by using .filter

def locate(df, entityId, entryDate)->pl.DataFrame:
    return df.filter(pl.col('entityId')==entityId).filter(pl.col('entryDate') == entryDate)

This is fairly fast, but it still takes 50 to 100ms to look up a row.

Are there any optimizations I'm missing?

Some things I tried:

  1. using .lazy / .collect (no change)
  2. sorting by entityId (no change)

I'm on polars 0.17.12


Solution

  • I'm going to start with

    size=1000000
    df=pl.DataFrame({'entityId':np.random.randint(0,200,size=size),
                    'entryDate':[datetime(2022,1,1).date() + timedelta(days=int(np.random.randint(1,200,1)[0])) for _ in range(size)],
                    'blah':np.random.random(size=size)})
    
    %%timeit
    df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())
    
    1.45 ms ± 14 µs per loop
    

    That's the benchmark.

    The first improvement you can make is to combine your filters instead of chaining them...

    %%timeit
    df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))
    
    996 µs ± 16.6 µs per loop
    

    Now let's try to sort the columns...

    df=df.sort(['entityId', 'entryDate'])
    

    After sorting...

    %%timeit
    df.filter(pl.col('entityId')==34).filter(pl.col('entryDate')==datetime(2022,2,16).date())
    
    1.33 ms ± 23.7 µs per loop
    

    barely an improvement over the initial 1.45ms

    %%timeit
    df.filter((pl.col('entityId')==34) & (pl.col('entryDate')==datetime(2022,2,16).date()))
    
    1.04 ms ± 35.5 µs per loop
    

    not sure why that took longer on average although they overlap in 1 std dev so basically the same.

    now let's try making a struct column out of the two columns which we'll call INDEX

    df=df.select(pl.struct(['entityId', 'entryDate']).alias('INDEX'), pl.exclude(['entityId', 'entryDate'])).sort('INDEX')
    
    
    %%timeit
    df.filter((pl.col('INDEX').struct.field('entityId')==34) & (pl.col('INDEX').struct.field('entryDate')==datetime(2022,2,16).date()))
    946 µs ± 18.4 µs per loop
    

    That's a slight improvement.

    Let's try partitioning the df by entityId

    dfdicts=df.unnest('INDEX').partition_by('entityId',as_dict=True)
    
    %%timeit
    dfdicts[34].filter(pl.col('entryDate')==datetime(2022,2,16).date())
    
    445 µs ± 6.51 µs per loop
    

    That's a big improvement!! Let's try partitioning the other way

    dfdicts2=df.unnest('INDEX').partition_by('entryDate',as_dict=True)
    
    %%timeit
    dfdicts2[datetime(2022,2,16).date()].filter(pl.col('entityId')==34)
    
    382 µs ± 7.23 µs per loop
    

    The search_sorted trick only works when you want to get a single row and in my random data it's been returning multiple rows. However if you're only looking to get a single row you could do

    %%timeit
    startrow=dfdicts2[datetime(2022,2,16).date()].get_column('entityId').search_sorted(34, side='left')
    dfdicts2[datetime(2022,2,16).date()].slice(startrow, 1)
    
    333 µs ± 8.31 µs per loop
    

    There's a slight improvement in the search_sorted trick against the regular filter but not as big as here

    I'm guessing that since that answer was posted that filter was made to be much more efficient so that's one reason why there's not much of an improvement. In this case the improvement might just be because it's only returning one row.

    TL;DR

    If you've got enough memory then partition by one of the columns, try both since the size will matter on which will give better performance. The search_sorted trick will only work on numbers, not dates, so it'll only be available if you partition by the date field. If you don't have the memory for the partition_by then combining your columns into a sorted struct plus keeping both filter expressions in one call to filter then that should be the next best. You can easily unnest('INDEX') to get the original shape back.

    As an aside If you can refactor whatever you're doing downstream into a groupby.agg and/or join then that would probably be much better in particular because of parallelization.