I want to aggregate DataFrame
based on a condition and I am able to do it but not for every scenario, here is an example:
import pandas as pd
import numpy as np
th=0.5
tuples = list(zip(*[
["foo", "foo", "foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "bar", "bar"],
["A", "A", "A", "B", "B", "B", "A", "A", "A", "B", "B", "B"],
["one", "two", "three", "one", "two", "three", "one", "two", "three", "one", "two", "three"],
]))
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second", "third"])
# create test DataFrame and a simpler copy
df1 = pd.DataFrame(np.array([0.04325033, 0.17730071, 0.05833607, 0.15995565, 0.48858341,
0.27516962, 0.26154782, 0.68857624, 0.75173469, 0.34701632,
0.10108433, 0.32280973]), index=index, columns=["data"])
df2 = df1.reset_index()[["first", "second", "data"]].groupby(["first", "second"]).sum().copy()
# if the amount is greater or equal to the threshold copy the value
df1.loc[df1['data'] >= th, 'test'] = df1.loc[df1['data'] >= th, 'data']
df2.loc[df2['data'] >= th, 'test'] = df2.loc[df2['data'] >= th, 'data']
# if the amount is smaller than the threshold aggregate all such values for the group
df1.loc[df1['data'] < th, 'test'] = df1.loc[df1['data'] < th, 'data'].groupby(["first", "second"]).sum()
df2.loc[df2['data'] < th, 'test'] = df2.loc[df2['data'] < th, 'data'].groupby(["first"]).sum()
it gives the output
data test
first second third
foo A one 0.043250 0.278887
two 0.177301 0.278887
three 0.058336 0.278887
B one 0.159956 0.923709
two 0.488583 0.923709
three 0.275170 0.923709
bar A one 0.261548 0.261548
two 0.688576 0.688576
three 0.751735 0.751735
B one 0.347016 0.770910
two 0.101084 0.770910
three 0.322810 0.770910
data test
first second
bar A 1.701859 1.701859
B 0.770910 0.770910
foo A 0.278887 NaN
B 0.923709 0.923709
The first DataFrame
is as intended but for the second one there is np.nan
where the aggregation should be. Why the first works and the second does not? How can I achieve this?
Bonus:
I would like to have a DataFrame
with n aggregation columns and do this procedure for each of them from the innermost to the outermost keeping the value if it is over a threshold and aggregating all those smaller than the threshold - is there a better solution than a for
loop?
The problem seems to be:
df2.loc[df2['data'] < th, 'data'].groupby(["first"]).sum()
First, don't pass a list of one element for groupby
, it raises a FutureWarning
:
FutureWarning: In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.
Next, if you group by one level, you have to broadcast the aggregated operation to all filtered rows to keep the same dimension (multilevel index)
>>> df2.loc[df2['data'] < th, 'data'].groupby('first').sum()
first
foo 0.278887
Name: data, dtype: float64
>>> df2.loc[df2['data'] < th, 'data'].groupby('first').transform('sum')
first second
foo A 0.278887
Name: data, dtype: float64
Now the levels are the same so you can set the values:
df2.loc[df2['data'] < th, 'test'] = \
df2.loc[df2['data'] < th, 'data'].groupby('first').transform('sum')
print(df2)
# Output
data test
first second
bar A 1.701859 1.701859
B 0.770910 0.770910
foo A 0.278887 0.278887 # <- HERE
B 0.923709 0.923709