I want to groupby two values and if the group contains more than one element, return only the first row of the group with the value replaced by the mean for the group. If there is only one element, I want to return directly. My code looks like this:
final = df.groupby(["a", "b"]).apply(condense).drop(['a', 'b'], axis=1).reset_index()
def condense(df):
if df.shape[0] > 1:
mean = df["c"].mean()
record = df.iloc[[0]]
record["c"] = mean
return(record)
else:
return(df)
And the df looks something like this:
a b c d
"f" "e" 2 True
"f" "e" 3 False
"c" "a" 1 True
As the data frame is quite large, I have 73800 groups and the computation of the whole groupby + apply takes about a minute. This is far too long. Is there a way to make it run faster?
I think mean of one value is same like mean of multiple values, so you can solution simplify by GroupBy.agg
with mean
for column c
and all another values aggregate by first
:
d = dict.fromkeys(df.columns.difference(['a','b']), 'first')
d['c'] = 'mean'
print (d)
{'c': 'mean', 'd': 'first'}
df = df.groupby(["a", "b"], as_index=False).agg(d)
print (df)
a b c d
0 c a 1.0 True
1 f e 2.5 True