Search code examples
pythonpandasdataframemulti-index

Transforming a multi-index dataframe to retain a groupby() aggregation


Issue:

My dataframe has multiple columns containing the information of certain products grouped by brands. Every product has some attributes represented in column 1, column 2 and so forth. These columns each capture additional information in the subcolumns min,max and all. The way the grouping is done, is that the brand and model form a multiindex of the dataframe

                          column 1       column 2      column 3      
brand       model         min max all    min max all   min max all    min max all

brand(1)    model(1)
            model(2)
            model(3)

brand(2)    model(1)    
  .         model(2) 
  .         model(3)
  .           .
brand(n)    

I need to add a new column to this dataframe called column 4 which needs to possess the same subcolumns min, max and all and must be the length of the dataframe.

Tried:

The following code creates a dataframe of the length of the original dataframe with the column new_column and its respective subcolumns min, max and all.

columns_to_add = pd.DataFrame(index= range(len(original_df.index)), columns =  ["new_column","new_column","new_column"],["min","max","all"]])
original_df = original_df.append(columns_to_add)

However when I visualise my original dataframe it seems i lose the grouping I initially had and get something like this.

                          column 1      column 2       column 3     new_column
                          min max all   min max all  min max all    min max all

(brand(1),model(1))
(brand(1),model(2))
(brand(1),model(3))

(brand(2),model(1))    
(brand(2),model(2))
(brand(2),model(3))
         .          
         .          
(brand(n),model(i))
         .           
 1410                     NaN NaN NaN   NaN NaN NaN   NaN NaN NaN   
 1411                     NaN NaN NaN   NaN NaN NaN   NaN NaN NaN                          
 1412                     NaN NaN NaN   NaN NaN NaN   NaN NaN NaN     

I don't understand where these extra rows starting from 1410 come from. I believe the ungrouping caused that, and it thus filled these rows with NaN.

Questions:

  1. Is there a way for me to add the new_column in such a way to retain the grouping?
  2. What can I do about these new rows filled with NaN?

Edit:

[1] I just realised that my original dataframe has 1440 rows while the new dataframe containing the empty rows is 2880 rows long, in other words double the length of the original dataframe. What caused the doubling of the rows?

[2] How could I perform a transform() on a multi-index dataframe in order to to keep the groupby? This would prevent doubling the amount of rows of my dataframe


Solution

  • To answer your main question, adding a new multi-level column with subcolumns can be done with MultiIndex.from_product. Some test data to illustrate the process:

    df = pd.DataFrame({'brand': [1,1,1,1,2,2,2,2], 'model': [3,3,4,4,5,5,5,6], 'col1': [1,2,3,4,5,6,7,8], 'col2': [9,8,7,6,5,4,3,2]})
    df = df.groupby(['brand', 'model']).agg({'col1': ['min', 'max', 'mean'], 'col2': ['min', 'max', 'mean']})
    

    which gives:

                    col1          col2
                min max mean  min max mean
    brand model                     
    1     3     1   2   1.5   8   9   8.5  
          4     3   4   3.5   6   7   6.5
    2     5     5   7   6.0   3   5   4.0
          6     8   8   8.0   2   2   2.0
    

    Adding the new multi-level column:

    df = df.join(pd.DataFrame(np.random.rand(len(df),3),
                              columns=pd.MultiIndex.from_product([['new_column'], ['min','max','mean']]),
                              index=df.index))
    

    Here np.random.rand(len(df),3) is used to fill in the values, but anything can be used as long as the dimensions are correct. Using nothing here will set the values in the new columns as NaN.

    Result:

                    col1          col2            new_column
                min max mean  min max mean  min      max      mean
    brand model                     
    1     3     1   2   1.5   8   9   8.5   0.065094 0.489666 0.476452
          4     3   4   3.5   6   7   6.5   0.280267 0.237083 0.272776
    2     5     5   7   6.0   3   5   4.0   0.650988 0.384788 0.486176
          6     8   8   8.0   2   2   2.0   0.025630 0.908280 0.386871