Search code examples
python-3.xpandaspandas-groupbyhierarchical-datamulti-index

pandas comma separated hierarchy groupby sum


I have the following data format of hierarchical data. There can be multiple rows of a single level and variable depth. I am trying to get the a result where in col_2 we see the aggregated sum of all the inferior levels of the instance.

Using a simple groupby does not work as it does not understand the hierarchical structure. I have tried splitting col_1 into multiple columns named level-1 to level-6 (depth), and consequently groupby level-1 to level-6, but the results are still incorrect in spite of dataframe being multi-index.

code before separation:
df.groupby(["col_1"], as_index=False).sum()

code after separation:
df.groupby(["level-1","level-2","level-3","level-4","level-5","level-6"], as_index=False).sum()

Any help would be appreciated!

UPDATE thanks to @Yo_Chris so far:

import pandas as pd
# sample data
df = pd.DataFrame({'Col1': ['PUU', 'PUU;UT', 'PUU;UT', 'PUU;UT;AHU', 'PUU;UT;AHU;CSP', 'PUU;AS', 'PUU;PREV', 'PUU;TECHNOLOGY', 'PUU;TECHNOLOGY', 'PUU;TECHNOLOGY;SPEC'],
                  'Col2': [1000,1000,50,500,250,100,1000,300,500,900]})

# groupby, sum and invert 
s = df.groupby('Col1')['Col2'].sum()[::-1]
# groupby, cumsum and invert
s.groupby(s.index.str[0]).cumsum()[::-1])```

# this results in the following:

Col1
PUU                    5600
PUU;AS                 4600
PUU;PREV               4500
PUU;TECHNOLOGY         3500
PUU;TECHNOLOGY;SPEC    2700
PUU;UT                 1800
PUU;UT;AHU              750
PUU;UT;AHU;CSP          250
Name: Col2, dtype: int64

While what we want is:

PUU                    5600
PUU;AS                  100
PUU;PREV               1000
PUU;TECHNOLOGY         1700
PUU;TECHNOLOGY;SPEC     900
PUU;UT                 1800
PUU;UT;AHU              750
PUU;UT;AHU;CSP          250

Solution

  • Solved this eventually by splitting col_1 to separate columns per depth. then group by each column (depth 1, 2,..6) and concatenation of all resulting dataframes. Not very clean but it works correctly!