Search code examples
pythonpandasregexrowaverage

Pandas: mean of values in each row of a df, but limited to a selected range of columns filtered with regex expression


Given my df

id  weight  Project   Exp_type   researcher events_d1 events_d2 events_d3 events_d4  events_d5   
0   50        p1        Acute      alex         0         0         0         4       2
1   52        p2        chronic    mat          0         1         1         5       1
2   75        p1        Acute      alex         1                   2                 1
3   53        p2        chronic    mat          0                             0       0

I would like to obtain the mean of values in each row but only limited to a selected interval of columns. (events_d2 to d3) so that df_output appears:

  weight  Project Exp_type   researcher events_d1 events_d2 events_d3 events_d4  events_d5  meand2_d4 
0 50      p1      Acute      alex         0         0         0         4          2          1.33
1 52      p2      chronic    mat          0         1         1         5          1          2.33
2 75      p1      Acute      alex         1                   2                    1          0.66
3 53      p2      chronic    mat          0                             0          0          0

I tried with the following

df['meand2_d4'] = df.filter(regex="events_d[2-4]").agg(np.mean, axis=1)

but obtained as output the mean of the values contained into the cells of each whole row not considering the column interval I'm interested in. I also noticed that results are averaged to the number of cells containing at least a zero, which is different for every row and depending on number of NaN/empty cells.


Solution

  • IIUC, and your blank cells are actually NaN, then you need to filter on axis=1 and fillna with 0 before taking the mean:

    df['meand2_d4'] = df.filter(regex=r'^events_d[2-4]$', axis=1).fillna(0).mean(axis=1)
    

    Output:

       id  weight Project Exp_type researcher  events_d1  events_d2  events_d3  events_d4  events_d5  meand2_d4
    0   0      50      p1    Acute       alex          0        0.0        0.0        4.0        2.0   1.333333
    1   1      52      p2  chronic        mat          0        1.0        1.0        5.0        1.0   2.333333
    2   2      75      p1    Acute       alex          1        NaN        2.0        NaN        1.0   0.666667
    3   3      53      p2  chronic        mat          0        NaN        NaN        0.0        0.0   0.000000
    

    If the values are actually empty strings, you could use replace instead of fillna:

    df['meand2_d4'] =df.filter(regex=r'^events_d[2-4]$', axis=1).replace({'':0}).mean(axis=1)
    

    Output:

       id  weight Project Exp_type researcher  events_d1 events_d2 events_d3 events_d4  events_d5  meand2_d4
    0   0      50      p1    Acute       alex          0         0         0       4.0        2.0   1.333333
    1   1      52      p2  chronic        mat          0         1         1       5.0        1.0   2.333333
    2   2      75      p1    Acute       alex          1                   2                  1.0   0.666667
    3   3      53      p2  chronic        mat          0                             0        0.0   0.000000