Assume you have to query the prices of like 100 items (in the example I will query just 2) that are unique, which would be the fastest (measured in seconds per query) way of querying the name of these items using a python data structure?
Here is the example with a pandas dataframe (with just querying 3 items, I would like to query around 50 or 100).
import pandas as pd
import time
df = pd.DataFrame({"price":[11,33,5,29,999]*100000},index=["car","boat","axe","fork","plane"]*100000)
# now query the price of these items:
time_start = time.time()
query = df.loc[["car","boat","plane"]]
time_elapsed = round(time.time()-time_start,2)
print(f"[INFO] Time elapsed: {time_elapsed} seconds")
print(query)
Can you think of faster ways than the one I showed? Ideally I would not want to think of a database, a data structure instead but I'm open to suggestion about databases like mongodb (not as answers, just as comments).
Thanks!
I doubt you'll get much better than using a mask:
import pandas as pd
df = pd.DataFrame({"price":[11,33,5,29,999]*100000},index=["car","boat","axe","fork","plane"]*100000)
items = ["car","boat","plane"]
def v1(df, items):
df = df.copy()
return df.T[items].T
def v2(df, items):
df = df.copy()
return df.loc[items]
def v3(df, items):
df = df.copy()
return df[df.index.isin(items)]
print('Treating index as columns:')
%timeit v1(df, items)
print('\nUsing loc:')
%timeit v2(df, items)
print('\nUsing a mask:')
%timeit v3(df, items)
Output:
Treating index as columns:
108 ms ± 8.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using loc:
57 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using a mask:
12.8 ms ± 305 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)