Search code examples
pythonpandasmeancalculated-columns

How to groupby part of a column name, and aggregate mean?


I can take an average of two columns, and return it as a new column, but I can't figure out how to do it for all the columns in the specific convention that raw data has. The goal is to average A1 & D1, ..., A12 & D12 etc..

I tried different iterations of what's below but it doesn't work if I don't specify exact columns...

import pandas as pd
df = pd.DataFrame({'Time': [0.1, 0.2, 0.3], 
                   'A1': [1000, 2000, 3000],
                   'A12': [1000, 2000, 3000],
                   'B1': [4000, 5000, 6000],
                   'B12': [4000, 5000, 6000],
                   'D1': [40000, 50000, 42000],
                   'D12': [40000, 50000, 42000]})

for column in d.columns:
    if column.startswith('A') and column.startswith('D'):
        d= (d[['A1', 'D1']].iloc[:].mean(axis=1))

Solution

    • As noted in the comments, column.startswith('A') and column.startswith('D') will never be true.
    • A solution in the comments was close, but the slicing was not in the correct location, and it was the mean of all the columns, not just A and D.
    • [1:] begins the slice after the first character (e.g. A1[1:] is 1)
    • Real Python: String Indexing
    • Create a Boolean mask with pandas.Series.str.contains to find only the columns of interest, A and D, in this case.
      • .contains is used, because it recognizes a regex pattern, and .startswith does not.
    • Perform the .groupby only on the desired columns
      • df[cols].groupby(df[cols].columns.str[1:], axis=1)
    # select columns with A or D
    cols = df.columns[df.columns.str.contains('A|D')]
    
    dfm = df.join(df[cols].groupby(df[cols].columns.str[1:], axis=1).mean().astype(float).add_prefix('AD_mean_'))
    
       Time    A1   A12    B1   B12     D1    D12  AD_mean_1  AD_mean_12
    0   0.1  1000  1000  4000  4000  40000  40000    20500.0     20500.0
    1   0.2  2000  2000  5000  5000  50000  50000    26000.0     26000.0
    2   0.3  3000  3000  6000  6000  42000  42000    22500.0     22500.0