I have this data frame:
df = pd.DataFrame({ "cluster" : ["A", "A", "A", "A", "A", "C", "C", "C", "C"],
"col1": ["0", "0", "1", "1", "0", "1", "0", "1", "0"],
"col2": ["1", "1", "1", "1", "0", "1", "1", "1", "0"] })
and I can't figure out the best way to have a data frame result with this format
col | cluster | avg |
---|---|---|
col1 | A | 0.4 |
col1 | C | 0.5 |
col2 | A | 0.8 |
col2 | C | 0.75 |
I tried using transpose, then melt the cluster, but didn't work. Any help would be highly appreciated!
My desired result is the table - where all columns (except cluster) are rows, and cluster is melt and the average is calculated for the values of the column for each cluster
Looks like a combination of melt
and groupby
+ mean
would do the trick. Note that the "col" values are strings in your input, so we also have to convert them to integers first before calculating the mean:
out = (df.melt(['cluster'], var_name='col')
.assign(value=lambda x: x['value'].astype(int))
.groupby(['col', 'cluster'], as_index=False).mean())
Output:
col cluster value
0 col1 A 0.40
1 col1 C 0.50
2 col2 A 0.80
3 col2 C 0.75