Search code examples
pythonpandassortinghierarchical

Pandas hierarchical sort


I have a dataframe of categories and amounts. Categories can be nested into sub categories an infinite levels using a colon separated string. I wish to sort it by descending amount. But in hierarchical type fashion like shown.

How I need it sorted

CATEGORY                            AMOUNT
Transport                           5000
Transport : Car                     4900
Transport : Train                   100
Household                           1100
Household : Utilities               600
Household : Utilities : Water       400
Household : Utilities : Electric    200
Household : Cleaning                100
Household : Cleaning : Bathroom     75
Household : Cleaning : Kitchen      25
Household : Rent                    400
Living                              250
Living : Other                      150
Living : Food                       100

EDIT: The data frame:

pd.DataFrame({
    "category": ["Transport", "Transport : Car", "Transport : Train", "Household", "Household : Utilities", "Household : Utilities : Water", "Household : Utilities : Electric", "Household : Cleaning", "Household : Cleaning : Bathroom", "Household : Cleaning : Kitchen", "Household : Rent", "Living", "Living : Other", "Living : Food"],
    "amount": [5000, 4900, 100, 1100, 600, 400, 200, 100, 75, 25, 400, 250, 150, 100]
})

Note: this is the order I want it. It may be in any arbitrary order before the sort.

EDIT2: If anyone looking for a similar solution I posted the one I settled on here: How to sort dataframe in pandas by value in hierarchical category structure


Solution

  • To answer my own question: I found a way. Kind of long winded but here it is.

    import numpy as np
    import pandas as pd
    
    
    def sort_tree_df(df, tree_column, sort_column):
        sort_key = sort_column + '_abs'
        df[sort_key] = df[sort_column].abs()
        df.index = pd.MultiIndex.from_frame(
            df[tree_column].str.split(":").apply(lambda x: [y.strip() for y in x]).apply(pd.Series))
        sort_columns = [df[tree_column].values, df[sort_key].values] + [
            df.groupby(level=list(range(0, x)))[sort_key].transform('max').values
            for x in range(df.index.nlevels - 1, 0, -1)
        ]
        sort_indexes = np.lexsort(sort_columns)
        df_sorted = df.iloc[sort_indexes[::-1]]
        df_sorted.reset_index(drop=True, inplace=True)
        df_sorted.drop(sort_key, axis=1, inplace=True)
        return df_sorted
    
    
    sort_tree_df(df, 'category', 'amount')