Reshape dictionary to make violin plot

I have some data that is saved in a dictionary of dataframes. The real data is much bigger with index up to 3000 and more columns.

In the end I want to make a violinplot of two of the columns in the dataframes but for multiple dictionary entries. The dictionary has a tuple as a key and I want to gather all entries which first number is the same.


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data_dict = {
    (5, 1): pd.DataFrame({"Data_1": [0.235954, 0.739301, 0.443639],
                          "Data_2": [0.069884, 0.236283, 0.458250],
                          "Data_3": [0.170902, 0.496346, 0.399278],
                          "Data_4": [0.888658, 0.591893, 0.381895]}),
    (5, 2): pd.DataFrame({"Data_1": [0.806812, 0.224321, 0.504660],
                          "Data_2": [0.070355, 0.943047, 0.579285],
                          "Data_3": [0.526866, 0.251339, 0.600688],
                          "Data_4": [0.283107, 0.409486, 0.307315]}),
    (7, 3): pd.DataFrame({"Data_1": [0.415159, 0.834547, 0.170972],
                          "Data_2": [0.125926, 0.401789, 0.759203],
                          "Data_3": [0.398494, 0.587857, 0.130558],
                          "Data_4": [0.202393, 0.395692, 0.035602]}),
    (7, 4): pd.DataFrame({"Data_1": [0.923432, 0.622174, 0.185039],
                          "Data_2": [0.759154, 0.126699, 0.783596],
                          "Data_3": [0.075643, 0.287721, 0.939428],
                          "Data_4": [0.983739, 0.738550, 0.108639]})
}

My idea was that I could re-arrange it into a different dictionary and then plot the violinplot. Say that 'Data_1' and 'Data_4' are of interest. So then I loop over the keys in dict as below.

new_dict = {}
for col in ['Data_1','Data_4']:
    df = pd.DataFrame()
    for i in [5,7]:
        temp = []   
        for key, value in dict.items():
            if key[0]==i:
                temp.extend(value[col])
        df[i] = temp
    new_dict[col] = df

This then make the following dict.

new_dict = 
{'Data_1':           5         7
 0  0.235954  0.415159
 1  0.739301  0.834547
 2  0.443639  0.170972
 3  0.806812  0.923432
 4  0.224321  0.622174
 5  0.504660  0.185039,
 'Data_4':           5         7
 0  0.888658  0.202393
 1  0.591893  0.395692
 2  0.381895  0.035602
 3  0.283107  0.983739
 4  0.409486  0.738550
 5  0.307315  0.108639}

Which I then loop over to make the violin plots for Data_1and Data_4.

for key, value in new_dict.items():
    fig, ax = plt.subplots()
    ax.violinplot(value, showmeans= True)
    ax.set(title = key, xlabel = 'Section', ylabel = 'Value')
    ax.set_xticks(np.arange(1,3), labels=['5','7'])

While I get the desired result it's very cumbersome to re-arrange the dictionary. Could this be done in a faster way? Since it's the same column I want for each dictionary entry I feel that it should.

Solution

You could minimize the reshaping by using concat+melt and a higher level plotting library like seaborn:

import seaborn as sns

sns.catplot(data=pd.concat(data_dict, names=['section', None])
                    [['Data_1', 'Data_4']]
                   .melt(ignore_index=False, var_name='dataset')
                   .reset_index(),
            row='dataset',
            x='section', y='value',
            kind='violin',
           )

Output:

Another approach to reshape:

tmp = (pd
   .concat(data_dict, names=['section', None])
                    [['Data_1', 'Data_4']]
   .pipe(lambda x: x.set_axis(pd.MultiIndex.from_arrays([x.index.get_level_values('section'),
                                                         x.groupby('section').cumcount()])))
   .T.stack()
)

# then access the datasets
tmp.loc['Data_1']
# section         5         7
# 0        0.235954  0.415159
# 1        0.739301  0.834547
# 2        0.443639  0.170972
# 3        0.806812  0.923432
# 4        0.224321  0.622174
# 5        0.504660  0.185039