Search code examples
pythonpandasmemory-efficient

Which is the fastest way of querying strings in python


Assume you have to query the prices of like 100 items (in the example I will query just 2) that are unique, which would be the fastest (measured in seconds per query) way of querying the name of these items using a python data structure?

Here is the example with a pandas dataframe (with just querying 3 items, I would like to query around 50 or 100).

import pandas as pd
import time
df = pd.DataFrame({"price":[11,33,5,29,999]*100000},index=["car","boat","axe","fork","plane"]*100000)
# now query the price of these items:
time_start = time.time()
query = df.loc[["car","boat","plane"]]
time_elapsed = round(time.time()-time_start,2)
print(f"[INFO] Time elapsed: {time_elapsed} seconds")
print(query) 

Can you think of faster ways than the one I showed? Ideally I would not want to think of a database, a data structure instead but I'm open to suggestion about databases like mongodb (not as answers, just as comments).

Thanks!


Solution

  • I doubt you'll get much better than using a mask:

    import pandas as pd
    df = pd.DataFrame({"price":[11,33,5,29,999]*100000},index=["car","boat","axe","fork","plane"]*100000)
    items = ["car","boat","plane"]
    
    def v1(df, items):
        df = df.copy()
        return df.T[items].T
    
    def v2(df, items):
        df = df.copy()
        return df.loc[items]
    
    def v3(df, items):
        df = df.copy()
        return df[df.index.isin(items)]
    
    print('Treating index as columns:')
    %timeit v1(df, items)
    print('\nUsing loc:')
    %timeit v2(df, items)
    print('\nUsing a mask:')
    %timeit v3(df, items)
    

    Output:

    Treating index as columns:
    108 ms ± 8.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    Using loc:
    57 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
    
    Using a mask:
    12.8 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)