Search code examples
pythonpandasdataframemulti-index

Searching a large DataFrame with a MultiIndex slow


I have a large Pandas DataFrame (~800M rows), which I have indexed on a MultiIndex with two indices, an int and a date. I want to retrieve a subset of the DataFrame's rows based on a list of ints (about 10k) that I have. The ints match the first index of the multi-index. The multi-index is unique.

The first thing I tried is to sort the index and then query it using loc:

df = get_my_df()  # 800M rows
ids = [...]       # 10k ints, sorted list

df.set_index(["int_idx", "date_idx"], inplace=True, drop=False)
df.sort_index(inplace=True)

idx = pd.IndexSlice
res = df.loc[idx[ids, :]]

However this is painfully slow, and I stopped running the code after about an hour.

Next thing I tried was to set only the first one as index. This is suboptimal for me because the index is not unique, and also later I'll need to to further filter by date:

df.set_index("int_idx", inplace=True, drop=False)
df.sort_index(inplace=True)

idx = pd.IndexSlice
res = df.loc[idx[ids, :]]

To my surprise this was an improvement, but still very slow.

I have two questions:

  1. How can I make my query faster? (Either using single index or multi-index)
  2. Why is a sorted multi-index still so slow?

Solution

  • It can be difficult to retrieve a subset of a DataFrame containing 800M rows. Here are some ideas to help your search go more quickly:

    1. Use .loc() with boolean indexing instead of pd.IndexSlice:

    Use boolean indexing with.loc() instead of pd.IndexSlice to slice your multi-index instead. This can assist Pandas in avoiding the costly practise of establishing a new index object for each slice when working with huge DataFrames.

    For example:

    res = df.loc[df.index.get_level_values('int_idx').isin(ids)]
    
    1. Avoid setting the index multiple times:

    It can be costly to set the index and sort the data numerous times. Try to just set the index once if you can, but try to avoid sorting it.

    For example:

    df.set_index(["int_idx", "date_idx"], inplace=True, drop=False)
    res = df[df.index.get_level_values('int_idx').isin(ids)]
    
    1. Use chunking or parallel processing:

    You might want to think about dividing your DataFrame into smaller parts, processing each one separately, and then concatenating the results if it is too big to store in memory. To speed up the query, you might also use parallel processing. Both of these tactics work well with the Dask library.

    In response to your second query, a sorted multi-index ought should be quicker than an unsorted one because it enables Pandas to utilise the quick search methods built into NumPy. However, if a huge DataFrame has numerous columns or the sorting order is complicated, sorting the data can be expensive. Generally speaking, sorting a DataFrame is an expensive process that should be avoided wherever possible.