Search code examples
pythonpandasdataframeperformanceisin

Is .isin() faster than .query()


Question:

Hi,

When searching for methods to make a selection of a dataframe (being relatively unexperienced with Pandas), I had the following question:

What is faster for large datasets - .isin() or .query()?

Query is somewhat more intuitive to read, so my preferred approach due to my line of work. However, testing it on a very small example dataset, query seems to be much slower.

Is there anyone who has tested this properly before? If so, what were the outcomes? I searched the web, but could not find another post on this.

See the sample code below, which works for Python 3.8.5.

Thanks a lot in advance for your help!

Code:
# Packages
import pandas as pd
import timeit
import numpy as np


# Create dataframe
df = pd.DataFrame({'name': ['Foo', 'Bar', 'Faz'],
               'owner': ['Canyon', 'Endurace', 'Bike']},
                index=['Frame', 'Type', 'Kind'])

# Show dataframe
df

# Create filter
selection = ['Canyon']

# Filter dataframe using 'isin' (type 1)
df_filtered = df[df['owner'].isin(selection)] 

%timeit df_filtered = df[df['owner'].isin(selection)]
213 µs ± 14 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# Filter dataframe using 'isin' (type 2)
df[np.isin(df['owner'].values, selection)]

%timeit df_filtered = df[np.isin(df['owner'].values, selection)]
128 µs ± 3.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


# Filter dataframe using 'query'
df_filtered = df.query("owner in @selection")

%timeit df_filtered = df.query("owner in @selection")
1.15 ms ± 9.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Solution

  • The best test in real data, here fast comparison for 3k, 300k,3M rows with this sample data:

    selection = ['Hedge']
    
    df = pd.concat([df] * 1000, ignore_index=True)
    In [139]: %timeit df[df['owner'].isin(selection)]
    449 µs ± 58 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    In [140]: %timeit df.query("owner in @selection")
    1.57 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    df = pd.concat([df] * 100000, ignore_index=True)
    In [142]: %timeit df[df['owner'].isin(selection)]
    8.25 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    In [143]: %timeit df.query("owner in @selection")
    13 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    df = pd.concat([df] * 1000000, ignore_index=True)
    In [145]: %timeit df[df['owner'].isin(selection)]
    94.5 ms ± 9.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    In [146]: %timeit df.query("owner in @selection")
    112 ms ± 499 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
        
    

    If check docs:

    DataFrame.query() using numexpr is slightly faster than Python for large frames

    Conclusion - The best test in real data, because depends of number of rows, number of matched values and also by length of list selection.