I have a data frame output from something that looks like below and always has a variable number of numerical columns denoted here as t1-t4.
hash name group trial t1 t2 t3 t4
1AomKHNL56l EP1 G01 clump 1 4 5 9
2oKA7J1B3GL EP2 G02 green 2 10 0 24
zsfAu5Q6I60 EP1 G01 clump 4 3 1 3
v68fGHY8zx1 EP2 G02 green 1 5 9 22
What I want to do is group the dataframe on the "group" category and sum all the numerical columns which I know I can do using
df.groupby(["group"]).sum()
This drops all my non-numerical columns though and I want something somewhat dynamic. For the "hash column" I want something that looks like the output of this:
df.groupy(["group"]).agg("hash":list)
And for the other non-numerical columns I just want to maintain them as is because they will all be the same across the unique group ids. So the end result would be something like this:
hash name group trial t1 t2 t3 t4
[1AomKHNL56l,zsfAu5Q6I60] EP1 G01 clump 5 7 6 12
[2oKA7J1B3GL,v68fGHY8zx1] EP2 G02 green 3 15 9 46
Right now the only ways I can think to do it involve either separately doing the agg and sum functions for each unique column case then merging the dataframes at the end OR typing out every column id in the agg function. I'm wondering if anyone knows of a way to natively do this in the groupby function without all the extra dataframe merging steps or having to list out every column id and what to do. Perhaps some kind of lambda statement?
The column ids for the hash, name, group, and trial are always the same but the numerical columns always have different names dependent on what sample groups are being looked at for reference.
You can use a dictionary of aggregation functions:
# default is sum
d = {c: 'sum' for c in df.columns}
# change a few other columns
d.update({'hash': list, 'name': 'first', 'group': 'first', 'trial': 'first'})
# aggregate
df.groupby('group', as_index=False).agg(d)
Alternative option to set up the dictionary:
d = {'first': ['name', 'group', 'trial'],
'sum': ['t1', 't2', 't3', 't4'],
list: ['hash']}
d = {k:v for v,l in d.items() for k in l}
df.groupby('group', as_index=False).agg(d)[df.columns]
NB. you an also combine both!
output:
hash name group trial t1 t2 t3 t4
0 [1AomKHNL56l, zsfAu5Q6I60] EP1 G01 clump 5 7 6 12
1 [2oKA7J1B3GL, v68fGHY8zx1] EP2 G02 green 3 15 9 46