Search code examples
pythonpandasdataframemulti-index

Dataframe : Group separate subcolumns with identical column names together


I have managed to formulate a dataframe having ~200 columns each having a subcolumn. Now after some calculations, a separate set of values have been stored in the same dataframe with the same column names. Hence they are stored with a different subcolumn name. My query is how to group these two sets of (column, subcolumn) pairs together.

The following is a smaller subset of a similar representation (What I have):

          Date    Column1     Column2   Column3      Column1     Column2   Column3
                      Num         Num       Num       ProdID      ProdID    ProdID
0   2015-01-02        100         106       111           AA          AA        AD
1   2015-01-02        105         118       109           AB          AA        AB
2   2015-01-02        102         173       188           AC          AC        AC
3   2015-01-02        112         101       105           AD          AB        AE
4   2015-01-02        139         191       156           AE          AE        AC
5   2015-01-02        126         172       117           AF          AE        AF

The following is what I want :

          Date              Column1              Column2                Column3      
                      Num    ProdID       Num     ProdID         Num     ProdID      
0   2015-01-02        100        AA       106         AA         111         AD
1   2015-01-02        105        AB       118         AA         109         AB
2   2015-01-02        102        AC       173         AC         188         AC
3   2015-01-02        112        AD       101         AB         105         AE
4   2015-01-02        139        AE       191         AE         156         AC
5   2015-01-02        126        AF       172         AE         117         AF

I have tried the basic groupby statements like

df = df.groupby(df.columns, axis=1)

but that doesn't work. I think it's something easy but can't figure it out


Solution

  • We can try with reindex:

    df = df.reindex(columns=[df.columns[0], *sorted(df.columns[1:])])
    

    df.columns[0] is Date which should not be sorted. Then the rest of the columns get sorted df.columns[1:].


    Sample Data

    import numpy as np
    import pandas as pd
    
    rows = 3
    df = pd.DataFrame(
        np.arange(1, (rows * 5) + 1).reshape(-1, 5),
        columns=pd.MultiIndex.from_arrays([
            ['Date', 'Column1', 'Column2', 'Column1', 'Column2'],
            ['', 'Num', 'Num', 'ProdID', 'ProdID']
        ])
    )
    df.iloc[:, 0] = pd.date_range(start='2021-01-01', periods=len(df), freq='D')
    

    Before:

            Date Column1 Column2 Column1 Column2
                     Num     Num  ProdID  ProdID
    0 2021-01-01       2       3       4       5
    1 2021-01-02       7       8       9      10
    2 2021-01-03      12      13      14      15
    

    After reindex:

            Date Column1        Column2       
                     Num ProdID     Num ProdID
    0 2021-01-01       2      4       3      5
    1 2021-01-02       7      9       8     10
    2 2021-01-03      12     14      13     15