Search code examples
pythonpandastime-serieshierarchy

How to get multiple levels of aggregated sums into time series columns from a dataframe


I have a pandas dataframe that has monthly counts at various hierarchical levels. It is in long format, and I want to convert to wide format, with columns for each level of aggregation.

It is of the following format:

date | country | state | county | population 
01-01| cc1     | s1    | c1     | 5
01-01| cc1     | s1    | c2     | 4
01-01| cc1     | s2    | c1     | 10
01-01| cc1     | s2    | c2     | 11
02-01| cc1     | s1    | c1     | 6
02-01| cc1     | s1    | c2     | 5
02-01| cc1     | s2    | c1     | 11
02-01| cc1     | s2    | c2     | 12
.
.

Now I want to transform this into the following format:

date | country_pop| s1_pop | s2_pop| .. | s1_c1_pop | s1_c2_pop| s2_c1_pop | s2_c2_pop|..

01-01| 30         | 9      | 21    | ...| 5         | 4        | 10         | 11        |..
02-01| 34         | 11     | 23    | ...| 6         | 5        | 11         | 12        |..
.
.

The total number of states is, 4, s1....s4.

And the counties in each state can be labelled c1.... c10 (some states might have less, and I want those columns to be zeros.)

I want to get a time series at each level of aggregation, ordered by the date. How do I get this ?


Solution

  • Let's do it this way using sum with the level parameter and pd.concat all the dataframes together.

    #Aggregate to lowest level of detail
    df_agg = df.groupby(['country', 'date', 'state', 'county'])[['population']].sum()
    
    #Reshape dataframe and flatten multiindex column header
    df_county = df_agg.unstack([-1, -2])
    df_county.columns = [f'{s}_{c}_{p}' for p, c, s in df_county.columns]
    
    #Sum to next level of detail and reshape
    df_state = df_agg.sum(level=[0, 1, 2]).unstack()
    df_state.columns = [f'{s}_{p}' for p, s in df_state.columns]
    
    #Sum to country level 
    df_country = df_agg.sum(level=[0, 1])
    
    #pd.concat horizontally with axis=1
    df_out = pd.concat([df_country, df_state, df_county], axis=1).reset_index()
    

    Output:

      country   date  population  s1_population  s2_population  s1_c1_population  \
    0     cc1  01-01          30              9             21                 5   
    1     cc1  02-01          34             11             23                 6   
    
       s1_c2_population  s2_c1_population  s2_c2_population  
    0                 4                10                11  
    1                 5                11                12