I have a pandas dataframe that has monthly counts at various hierarchical levels. It is in long format, and I want to convert to wide format, with columns for each level of aggregation.
It is of the following format:
date | country | state | county | population
01-01| cc1 | s1 | c1 | 5
01-01| cc1 | s1 | c2 | 4
01-01| cc1 | s2 | c1 | 10
01-01| cc1 | s2 | c2 | 11
02-01| cc1 | s1 | c1 | 6
02-01| cc1 | s1 | c2 | 5
02-01| cc1 | s2 | c1 | 11
02-01| cc1 | s2 | c2 | 12
.
.
Now I want to transform this into the following format:
date | country_pop| s1_pop | s2_pop| .. | s1_c1_pop | s1_c2_pop| s2_c1_pop | s2_c2_pop|..
01-01| 30 | 9 | 21 | ...| 5 | 4 | 10 | 11 |..
02-01| 34 | 11 | 23 | ...| 6 | 5 | 11 | 12 |..
.
.
The total number of states is, 4, s1....s4.
And the counties in each state can be labelled c1.... c10 (some states might have less, and I want those columns to be zeros.)
I want to get a time series at each level of aggregation, ordered by the date. How do I get this ?
Let's do it this way using sum with the level parameter and pd.concat all the dataframes together.
#Aggregate to lowest level of detail
df_agg = df.groupby(['country', 'date', 'state', 'county'])[['population']].sum()
#Reshape dataframe and flatten multiindex column header
df_county = df_agg.unstack([-1, -2])
df_county.columns = [f'{s}_{c}_{p}' for p, c, s in df_county.columns]
#Sum to next level of detail and reshape
df_state = df_agg.sum(level=[0, 1, 2]).unstack()
df_state.columns = [f'{s}_{p}' for p, s in df_state.columns]
#Sum to country level
df_country = df_agg.sum(level=[0, 1])
#pd.concat horizontally with axis=1
df_out = pd.concat([df_country, df_state, df_county], axis=1).reset_index()
Output:
country date population s1_population s2_population s1_c1_population \
0 cc1 01-01 30 9 21 5
1 cc1 02-01 34 11 23 6
s1_c2_population s2_c1_population s2_c2_population
0 4 10 11
1 5 11 12