I have a large Pandas DataFrame (~800M rows), which I have indexed on a MultiIndex
with two indices, an int and a date. I want to retrieve a subset of the DataFrame's rows based on a list of ints (about 10k) that I have. The ints match the first index of the multi-index. The multi-index is unique.
The first thing I tried is to sort the index and then query it using loc
:
df = get_my_df() # 800M rows
ids = [...] # 10k ints, sorted list
df.set_index(["int_idx", "date_idx"], inplace=True, drop=False)
df.sort_index(inplace=True)
idx = pd.IndexSlice
res = df.loc[idx[ids, :]]
However this is painfully slow, and I stopped running the code after about an hour.
Next thing I tried was to set only the first one as index. This is suboptimal for me because the index is not unique, and also later I'll need to to further filter by date:
df.set_index("int_idx", inplace=True, drop=False)
df.sort_index(inplace=True)
idx = pd.IndexSlice
res = df.loc[idx[ids, :]]
To my surprise this was an improvement, but still very slow.
I have two questions:
It can be difficult to retrieve a subset of a DataFrame containing 800M rows. Here are some ideas to help your search go more quickly:
Use boolean indexing with.loc() instead of pd.IndexSlice to slice your multi-index instead. This can assist Pandas in avoiding the costly practise of establishing a new index object for each slice when working with huge DataFrames.
For example:
res = df.loc[df.index.get_level_values('int_idx').isin(ids)]
It can be costly to set the index and sort the data numerous times. Try to just set the index once if you can, but try to avoid sorting it.
For example:
df.set_index(["int_idx", "date_idx"], inplace=True, drop=False)
res = df[df.index.get_level_values('int_idx').isin(ids)]
You might want to think about dividing your DataFrame into smaller parts, processing each one separately, and then concatenating the results if it is too big to store in memory. To speed up the query, you might also use parallel processing. Both of these tactics work well with the Dask library.
In response to your second query, a sorted multi-index ought should be quicker than an unsorted one because it enables Pandas to utilise the quick search methods built into NumPy. However, if a huge DataFrame has numerous columns or the sorting order is complicated, sorting the data can be expensive. Generally speaking, sorting a DataFrame is an expensive process that should be avoided wherever possible.