Here is my dilemma:
I got a dictionary of dataframes like this:
dict_df[key]
m1 m2 m3 m4 m5 m6
10410 5 10 21 33 11
15387 3 10 33 45 13
19026 4 16 27 40 11
26083 5 21 16 29 9
27806 4 17 23 31 7
43820 2 12 27 40 18
49199 7 22 30 38 11
50094 4 9 13 18 4
Per each key, it returns a DF with the same column names.
For each key, I need to store the mean and standard deviation of a set of features (let's take for example m2, m3, m4).
In the end, I want to obtain something like this df below (the numbers are totally random):
key m2_mean m2_std m3_mean m3_std m4_mean m4_std
key1 12 55 793 438 44 95
key2 14 442 21 43 14 442
key3 44 1 66 11 42 42
key4 42 42 2 23 98 70
The dataset is not that big, so even if the code is slow should be fine.
Thanks for the help and have a good one!
First, let's define some sample data:
>>> df1 = pd.DataFrame({
"col1": [1, 2, 3],
"col2": [4, 5, 6],
})
>>> df2 = pd.DataFrame({
"col1": [7, 8, 9],
"col2": [10, 11, 12],
})
>>> dict_df = {
"df1": df1,
"df2": df2,
}
Now, you can use .agg()
to calculate the mean and std of your dataframe (I've used max for simplicity), .stack()
to reduce the dataframe into a single series, and .to_dict()
to generate a representation of this result as a string. Notice that we'll only use one of the dataframes (df1
) to show this result:
>>> df1.agg(["mean", "max"]).stack().to_dict()
{('mean', 'col1'): 2.0, ('mean', 'col2'): 5.0, ('max', 'col1'): 3.0, ('max', 'col2'): 6.0}
With this dict representation, we can use pd.DataFrame.from_dict
to build a single dataframe with the metrics for each value in dict_df
:
>>> df = pd.DataFrame.from_dict({
df_name: df[["col1", "col2"]].agg(["mean", "max"]).stack().to_dict()
for df_name, df in dict_df.items()
}, orient="index")
>>> df
mean max
col1 col2 col1 col2
df1 2.0 5.0 3.0 6.0
df2 8.0 11.0 9.0 12.0
The only important difference with your expected output is in the column names, but we can solve that manually:
>>> df.columns = ["_".join(column) for column in df.columns]
>>> df
mean_col1 mean_col2 max_col1 max_col2
df1 2.0 5.0 3.0 6.0
df2 8.0 11.0 9.0 12.0
Code that would do the trick for you:
>>> target_columns = ["m2", "m3", "m4"]
>>> df = pd.DataFrame.from_dict({
key: df[target_columns].agg(["mean", "std"]).stack().to_dict()
for key, df in dict_df.items()
}, orient="index")
>>> df.columns = ["_".join(column) for column in df.columns]
>>> df.index.name = "key"