Search code examples
pandasmulti-index

MultiIndex isn't kept when pd.concating multiple subtotal rows


I lose my multiIndex when I try to pd.concat a second subtotal. I'm able to add the first subtotal but not the second which is a sum of B0.

This is how my current df is:

lvl0                    a         b     
lvl1                  bar  foo  bah  foo
A0 B0 C0 D0             1    0    3    2
         D1             5    4    7    6
         First Total    6    4   10    8
      C1 D0             9    8   11   10
         D1            13   12   15   14
         First Total   22   20   26   24
      C2 D0            17   16   19   18

After trying to add the second subtotal I get this:

lvl0                            a           b      
lvl1                          bar   foo   bah   foo
(A0, B0, C2, First Total)    38    36    42    40
(A0, B0, C3, D0)             25    24    27    26
(A0, B0, C3, D1)             29    28    31    30
(A0, B0, C3, First Total)    54    52    58    56
(A0, B0, Second Total)      120   112   136   128
(A0, B1, C0, D0)             33    32    35    34
(A0, B1, C0, D1)             37    36    39    38
(A0, B1, C0, First Total)    70    68    74    72
(A0, B1, C1, D0)             41    40    43    42

You should be able to copy and paste the code below to test

import pandas as pd
import numpy as np

# creating multiIndex
def mklbl(prefix, n):
    return ["%s%s" % (prefix, i) for i in range(n)]


miindex = pd.MultiIndex.from_product([mklbl('A', 4),
                                        mklbl('B', 2),
                                        mklbl('C', 4),
                                        mklbl('D', 2)])


micolumns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'),
                                        ('b', 'foo'), ('b', 'bah')],
                                      names=['lvl0', 'lvl1'])


dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
                     .reshape((len(miindex), len(micolumns))),
                    index=miindex,
                   columns=micolumns).sort_index().sort_index(axis=1)

# My code STARTS HERE
# creating the first subtotal
print(dfmi.index)
df1 = dfmi.groupby(level=[0,1,2]).sum()
df2 = dfmi.groupby(level=[0, 1]).sum()

df1 = df1.set_index(np.array(['First Total'] * len(df1)), append=True)
dfmi = pd.concat([dfmi, df1]).sort_index(level=[0, 1])

print(dfmi)

# this is where the multiIndex is lost
df2 = df2.set_index(np.array(['Second Total'] * len(df2)), append=True)
dfmi = pd.concat([dfmi, df2]).sort_index(level=[1])

print(dfmi)

How I would want it to look:

lvl0                    a         b     
lvl1                  bar  foo  bah  foo
A0 B0 C0 D0             1    0    3    2
         D1             5    4    7    6
         First Total    6    4   10    8
      C1 D0             9    8   11   10
         D1            13   12   15   14
         First Total   22   20   26   24
      C2 D0            17   16   19   18
         D1            21   20   23   22
         First Total   38   36   42   40
      C3 D0            25   24   27   26
         D1            29   28   31   30
         First Total   54   52   58   56
      Second Total     120  112  136  128
   B1 C0 D0            33   32   35   34
         D1            37   36   39   38
         First Total   70   68   74   72
      C1 D0            41   40   43   42
         D1            45   44   47   46
         First Total   86   84   90   88
      C2 D0            49   48   51   50
         D1            53   52   55   54
         First Total  102  100  106  104
      C3 D0            57   56   59   58
         D1            61   60   63   62
         First Total  118  116  122  120
       Second Total   376  368  392  384

the first total is sum of level 2, the second total is sum of level 1


Solution

  • dfmi has a 4-level MultiIndex:

    In [208]: dfmi.index.nlevels
    Out[208]: 4
    

    df2 has a 3-level MultiIndex. Instead, if you use

    df2 = df2.set_index([np.array(['Second Total'] * len(df2)), [''] * len(df2)], append=True)
    

    then df2 ends up with a 4-level MultiIndex. When dfmi and df2 have the same number of levels, then pd.concat([dfmi, df2]) produces the desired result.


    One problem you may face when sorting by index labels is that it relies on the strings 'First' and 'Second' appearing last in alphabetic order. An alterative to sorting by index would be assigning a numeric order column and sorting by that instead:

    dfmi['order'] = range(len(dfmi))
    df1['order'] = dfmi.groupby(level=[0,1,2])['order'].last() + 0.1
    df2['order'] = dfmi.groupby(level=[0,1])['order'].last() + 0.2
    ...
    dfmi = pd.concat([dfmi, df1, df2])        
    dfmi = dfmi.sort_values(by='order')
    

    Incorporating Scott Boston's improvement, the code would then look like this:

    import pandas as pd
    import numpy as np
    
    def mklbl(prefix, n):
        return ["%s%s" % (prefix, i) for i in range(n)]
    
    
    miindex = pd.MultiIndex.from_product([mklbl('A', 4),
                                            mklbl('B', 2),
                                            mklbl('C', 4),
                                            mklbl('Z', 2)])
    
    
    micolumns = pd.MultiIndex.from_tuples([('a', 'foo'), ('a', 'bar'),
                                            ('b', 'foo'), ('b', 'bah')],
                                          names=['lvl0', 'lvl1'])
    
    
    dfmi = pd.DataFrame(np.arange(len(miindex) * len(micolumns))
                         .reshape((len(miindex), len(micolumns))),
                        index=miindex,
                       columns=micolumns).sort_index().sort_index(axis=1)
    
    df1 = dfmi.groupby(level=[0,1,2]).sum()
    df2 = dfmi.groupby(level=[0, 1]).sum()
    
    dfmi['order'] = range(len(dfmi))
    df1['order'] = dfmi.groupby(level=[0,1,2])['order'].last() + 0.1
    df2['order'] = dfmi.groupby(level=[0,1])['order'].last() + 0.2
    
    df1 = df1.assign(lev4='First').set_index('lev4', append=True)
    df2 = df2.assign(lev3='Second', lev4='').set_index(['lev3','lev4'], append=True)
    dfmi = pd.concat([dfmi, df1, df2])
    dfmi = dfmi.sort_values(by='order')
    dfmi = dfmi.drop(['order'], axis=1)
    print(dfmi)
    

    which yields

    lvl0                  a         b     
    lvl1                bar  foo  bah  foo
    A0 B0 C0     Z0       1    0    3    2
                 Z1       5    4    7    6
                 First    6    4   10    8
          C1     Z0       9    8   11   10
                 Z1      13   12   15   14
                 First   22   20   26   24
          C2     Z0      17   16   19   18
                 Z1      21   20   23   22
                 First   38   36   42   40
          C3     Z0      25   24   27   26
                 Z1      29   28   31   30
                 First   54   52   58   56
          Second        120  112  136  128
    ...