Search code examples
sortingpandasmulti-index

Pandas multiindex sort


In Pandas 0.19 I have a large dataframe with a Multiindex of the following form

          C0     C1     C2
A   B
bar one   4      2      4
    two   1      3      2
foo one   9      7      1
    two   2      1      3

I want to sort bar and foo (and many more double lines as them) according to "two" to get the following:

          C0     C1     C2
A   B
bar one   4      4      2
    two   1      2      3
foo one   7      9      1
    two   1      2      3

I am interested in speed (as I have many columns and many pairs of rows). I am also happy with re-arranging the data if it speeds up the sorting. Many thanks


Solution

  • Here is a mostly numpy solution that should yield good performance. It first selects only the 'two' rows and argsorts them. It then sets this order for each row of the original dataframe. It then unravels this order (after adding a constant to offset each row) and the original dataframe values. It then reorders all the original values based on this unraveled, offset and argsorted array before creating a new dataframe with the intended sort order.

    rows, cols = df.shape
    df_a = np.argsort(df.xs('two', level=1))
    order = df_a.reindex(df.index.droplevel(-1)).values
    offset = np.arange(len(df)) * cols
    order_final = order + offset[:, np.newaxis]
    pd.DataFrame(df.values.ravel()[order_final.ravel()].reshape(rows, cols), index=df.index, columns=df.columns)
    

    Output

             C0  C1  C2
    A   B              
    bar one   4   4   2
        two   1   2   3
    foo one   7   9   1
        two   1   2   3
    

    Some Speed tests

    # create much larger frame
    import string
    idx = pd.MultiIndex.from_product((list(string.ascii_letters), list(string.ascii_letters) + ['two']))
    df1 = pd.DataFrame(index=idx, data=np.random.rand(len(idx), 3), columns=['C0', 'C1', 'C2'])
    
    #scott boston
    %timeit df1.groupby(level=0).apply(sortit)
    10 loops, best of 3: 199 ms per loop
    
    #Ted
    1000 loops, best of 3: 5 ms per loop