My dataframe has multiple columns containing the information of certain products grouped by brands. Every product has some attributes represented in column 1, column 2 and so forth. These columns each capture additional information in the subcolumns min,max and all. The way the grouping is done, is that the brand and model form a multiindex of the dataframe
column 1 column 2 column 3
brand model min max all min max all min max all min max all
brand(1) model(1)
model(2)
model(3)
brand(2) model(1)
. model(2)
. model(3)
. .
brand(n)
I need to add a new column to this dataframe called column 4 which needs to possess the same subcolumns min
, max
and all
and must be the length of the dataframe.
The following code creates a dataframe of the length of the original dataframe with the column new_column
and its respective subcolumns min
, max
and all
.
columns_to_add = pd.DataFrame(index= range(len(original_df.index)), columns = ["new_column","new_column","new_column"],["min","max","all"]])
original_df = original_df.append(columns_to_add)
However when I visualise my original dataframe it seems i lose the grouping I initially had and get something like this.
column 1 column 2 column 3 new_column
min max all min max all min max all min max all
(brand(1),model(1))
(brand(1),model(2))
(brand(1),model(3))
(brand(2),model(1))
(brand(2),model(2))
(brand(2),model(3))
.
.
(brand(n),model(i))
.
1410 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1411 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1412 NaN NaN NaN NaN NaN NaN NaN NaN NaN
I don't understand where these extra rows starting from 1410 come from. I believe the ungrouping caused that, and it thus filled these rows with NaN
.
new_column
in such a way to retain the grouping? NaN
?[1] I just realised that my original dataframe has 1440 rows while the new dataframe containing the empty rows is 2880 rows long, in other words double the length of the original dataframe. What caused the doubling of the rows?
[2] How could I perform a transform()
on a multi-index dataframe in order to to keep the groupby? This would prevent doubling the amount of rows of my dataframe
To answer your main question, adding a new multi-level column with subcolumns can be done with MultiIndex.from_product
. Some test data to illustrate the process:
df = pd.DataFrame({'brand': [1,1,1,1,2,2,2,2], 'model': [3,3,4,4,5,5,5,6], 'col1': [1,2,3,4,5,6,7,8], 'col2': [9,8,7,6,5,4,3,2]})
df = df.groupby(['brand', 'model']).agg({'col1': ['min', 'max', 'mean'], 'col2': ['min', 'max', 'mean']})
which gives:
col1 col2
min max mean min max mean
brand model
1 3 1 2 1.5 8 9 8.5
4 3 4 3.5 6 7 6.5
2 5 5 7 6.0 3 5 4.0
6 8 8 8.0 2 2 2.0
Adding the new multi-level column:
df = df.join(pd.DataFrame(np.random.rand(len(df),3),
columns=pd.MultiIndex.from_product([['new_column'], ['min','max','mean']]),
index=df.index))
Here np.random.rand(len(df),3)
is used to fill in the values, but anything can be used as long as the dimensions are correct. Using nothing here will set the values in the new columns as NaN
.
Result:
col1 col2 new_column
min max mean min max mean min max mean
brand model
1 3 1 2 1.5 8 9 8.5 0.065094 0.489666 0.476452
4 3 4 3.5 6 7 6.5 0.280267 0.237083 0.272776
2 5 5 7 6.0 3 5 4.0 0.650988 0.384788 0.486176
6 8 8 8.0 2 2 2.0 0.025630 0.908280 0.386871