Given my df
id weight Project Exp_type researcher events_d1 events_d2 events_d3 events_d4 events_d5
0 50 p1 Acute alex 0 0 0 4 2
1 52 p2 chronic mat 0 1 1 5 1
2 75 p1 Acute alex 1 2 1
3 53 p2 chronic mat 0 0 0
I would like to obtain the mean of values in each row but only limited to a selected interval of columns. (events_d2 to d3) so that df_output appears:
weight Project Exp_type researcher events_d1 events_d2 events_d3 events_d4 events_d5 meand2_d4
0 50 p1 Acute alex 0 0 0 4 2 1.33
1 52 p2 chronic mat 0 1 1 5 1 2.33
2 75 p1 Acute alex 1 2 1 0.66
3 53 p2 chronic mat 0 0 0 0
I tried with the following
df['meand2_d4'] = df.filter(regex="events_d[2-4]").agg(np.mean, axis=1)
but obtained as output the mean of the values contained into the cells of each whole row not considering the column interval I'm interested in. I also noticed that results are averaged to the number of cells containing at least a zero, which is different for every row and depending on number of NaN/empty cells.
IIUC, and your blank cells are actually NaN
, then you need to filter
on axis=1
and fillna
with 0
before taking the mean:
df['meand2_d4'] = df.filter(regex=r'^events_d[2-4]$', axis=1).fillna(0).mean(axis=1)
Output:
id weight Project Exp_type researcher events_d1 events_d2 events_d3 events_d4 events_d5 meand2_d4
0 0 50 p1 Acute alex 0 0.0 0.0 4.0 2.0 1.333333
1 1 52 p2 chronic mat 0 1.0 1.0 5.0 1.0 2.333333
2 2 75 p1 Acute alex 1 NaN 2.0 NaN 1.0 0.666667
3 3 53 p2 chronic mat 0 NaN NaN 0.0 0.0 0.000000
If the values are actually empty strings, you could use replace
instead of fillna
:
df['meand2_d4'] =df.filter(regex=r'^events_d[2-4]$', axis=1).replace({'':0}).mean(axis=1)
Output:
id weight Project Exp_type researcher events_d1 events_d2 events_d3 events_d4 events_d5 meand2_d4
0 0 50 p1 Acute alex 0 0 0 4.0 2.0 1.333333
1 1 52 p2 chronic mat 0 1 1 5.0 1.0 2.333333
2 2 75 p1 Acute alex 1 2 1.0 0.666667
3 3 53 p2 chronic mat 0 0 0.0 0.000000