I have the following data format of hierarchical data. There can be multiple rows of a single level and variable depth. I am trying to get the a result where in col_2 we see the aggregated sum of all the inferior levels of the instance.
Using a simple groupby does not work as it does not understand the hierarchical structure. I have tried splitting col_1 into multiple columns named level-1 to level-6 (depth), and consequently groupby level-1 to level-6, but the results are still incorrect in spite of dataframe being multi-index.
code before separation:
df.groupby(["col_1"], as_index=False).sum()
code after separation:
df.groupby(["level-1","level-2","level-3","level-4","level-5","level-6"], as_index=False).sum()
Any help would be appreciated!
UPDATE thanks to @Yo_Chris so far:
import pandas as pd
# sample data
df = pd.DataFrame({'Col1': ['PUU', 'PUU;UT', 'PUU;UT', 'PUU;UT;AHU', 'PUU;UT;AHU;CSP', 'PUU;AS', 'PUU;PREV', 'PUU;TECHNOLOGY', 'PUU;TECHNOLOGY', 'PUU;TECHNOLOGY;SPEC'],
'Col2': [1000,1000,50,500,250,100,1000,300,500,900]})
# groupby, sum and invert
s = df.groupby('Col1')['Col2'].sum()[::-1]
# groupby, cumsum and invert
s.groupby(s.index.str[0]).cumsum()[::-1])```
# this results in the following:
Col1
PUU 5600
PUU;AS 4600
PUU;PREV 4500
PUU;TECHNOLOGY 3500
PUU;TECHNOLOGY;SPEC 2700
PUU;UT 1800
PUU;UT;AHU 750
PUU;UT;AHU;CSP 250
Name: Col2, dtype: int64
While what we want is:
PUU 5600
PUU;AS 100
PUU;PREV 1000
PUU;TECHNOLOGY 1700
PUU;TECHNOLOGY;SPEC 900
PUU;UT 1800
PUU;UT;AHU 750
PUU;UT;AHU;CSP 250
Solved this eventually by splitting col_1 to separate columns per depth. then group by each column (depth 1, 2,..6) and concatenation of all resulting dataframes. Not very clean but it works correctly!