Search code examples
pythonpandasdataframegroup-byaggregate

Grouping pandas data frame columns according to an external vector


I am trying to calculate row sums of a pandas data frame for columns groupped according to an external vector. For example, for a matrix

matrix = pd.DataFrame(np.array( [[1,2,3,4,5], [10,20,30,40,50],[100,200,300,400,500]  ]  ))
matrix.columns =  ["T1", "T3", "T1", "T1", "T3"]
print(matrix)
    T1   T3   T1   T1   T3
0    1    2    3    4    5
1   10   20   30   40   50
2  100  200  300  400  500

the desired output is:

    T1  T2   T3
0    8   0    7
1   80   0   70
2  800   0  700

Therefore, I would like to group columns by a potentially broader vector that columns names (matrix.columns).

I obviosly tried:

matrix.groupby(matrix.columns, axis=1).sum()
    T1   T3
0    8    7
1   80   70
2  800  700

but this yields a differrent result.

I also tried:

matrix.groupby(["T1", "T2", "T3"], axis=1).sum()

but this results in an error.


Solution

  • because T2 does not exist in matrix.columns one method would be to add it.

    It could either be added as a new zero value column in the begining.

    Alternatively, it could be added at the end (after the groupby is done).

    here is an example of adding it after the groupby is done:

    import pandas as pd
    import numpy as np
    
    matrix = pd.DataFrame(np.array( [[1,2,3,4,5], [10,20,30,40,50],[100,200,300,400,500]  ]  ))
    matrix.columns =  ["T1", "T3", "T1", "T1", "T3"]
    print(matrix)
    
    z = matrix.groupby(matrix.columns, axis=1).sum()
    print(z)
    
    columns_to_add = []
    columns_you_want = ["T1", "T2", "T3"]
    for c in columns_you_want:
        if c not in matrix.columns:
            # print('we want this', c)
            columns_to_add.append(c)
    
    print('columns to add:', columns_to_add)
    
    for c in columns_to_add:
        # Add a new column with default value 0
        print('adding', c)
        z[c] = 0
        
    
    print(z)
    

    the result is this:

        T1   T3   T1   T1   T3
    0    1    2    3    4    5
    1   10   20   30   40   50
    2  100  200  300  400  500
        T1   T3
    0    8    7
    1   80   70
    2  800  700
    columns to add: ['T2']
    adding T2
        T1   T3  T2
    0    8    7   0
    1   80   70   0
    2  800  700   0