I have some data that is saved in a dictionary of dataframes. The real data is much bigger with index up to 3000 and more columns.
In the end I want to make a violinplot of two of the columns in the dataframes but for multiple dictionary entries. The dictionary has a tuple as a key and I want to gather all entries which first number is the same.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
data_dict = {
(5, 1): pd.DataFrame({"Data_1": [0.235954, 0.739301, 0.443639],
"Data_2": [0.069884, 0.236283, 0.458250],
"Data_3": [0.170902, 0.496346, 0.399278],
"Data_4": [0.888658, 0.591893, 0.381895]}),
(5, 2): pd.DataFrame({"Data_1": [0.806812, 0.224321, 0.504660],
"Data_2": [0.070355, 0.943047, 0.579285],
"Data_3": [0.526866, 0.251339, 0.600688],
"Data_4": [0.283107, 0.409486, 0.307315]}),
(7, 3): pd.DataFrame({"Data_1": [0.415159, 0.834547, 0.170972],
"Data_2": [0.125926, 0.401789, 0.759203],
"Data_3": [0.398494, 0.587857, 0.130558],
"Data_4": [0.202393, 0.395692, 0.035602]}),
(7, 4): pd.DataFrame({"Data_1": [0.923432, 0.622174, 0.185039],
"Data_2": [0.759154, 0.126699, 0.783596],
"Data_3": [0.075643, 0.287721, 0.939428],
"Data_4": [0.983739, 0.738550, 0.108639]})
}
My idea was that I could re-arrange it into a different dictionary and then plot the violinplot. Say that 'Data_1' and 'Data_4' are of interest. So then I loop over the keys in dict
as below.
new_dict = {}
for col in ['Data_1','Data_4']:
df = pd.DataFrame()
for i in [5,7]:
temp = []
for key, value in dict.items():
if key[0]==i:
temp.extend(value[col])
df[i] = temp
new_dict[col] = df
This then make the following dict.
new_dict =
{'Data_1': 5 7
0 0.235954 0.415159
1 0.739301 0.834547
2 0.443639 0.170972
3 0.806812 0.923432
4 0.224321 0.622174
5 0.504660 0.185039,
'Data_4': 5 7
0 0.888658 0.202393
1 0.591893 0.395692
2 0.381895 0.035602
3 0.283107 0.983739
4 0.409486 0.738550
5 0.307315 0.108639}
Which I then loop over to make the violin plots for Data_1
and Data_4
.
for key, value in new_dict.items():
fig, ax = plt.subplots()
ax.violinplot(value, showmeans= True)
ax.set(title = key, xlabel = 'Section', ylabel = 'Value')
ax.set_xticks(np.arange(1,3), labels=['5','7'])
While I get the desired result it's very cumbersome to re-arrange the dictionary. Could this be done in a faster way? Since it's the same column I want for each dictionary entry I feel that it should.
You could minimize the reshaping by using concat
+melt
and a higher level plotting library like seaborn
:
import seaborn as sns
sns.catplot(data=pd.concat(data_dict, names=['section', None])
[['Data_1', 'Data_4']]
.melt(ignore_index=False, var_name='dataset')
.reset_index(),
row='dataset',
x='section', y='value',
kind='violin',
)
Output:
Another approach to reshape:
tmp = (pd
.concat(data_dict, names=['section', None])
[['Data_1', 'Data_4']]
.pipe(lambda x: x.set_axis(pd.MultiIndex.from_arrays([x.index.get_level_values('section'),
x.groupby('section').cumcount()])))
.T.stack()
)
# then access the datasets
tmp.loc['Data_1']
# section 5 7
# 0 0.235954 0.415159
# 1 0.739301 0.834547
# 2 0.443639 0.170972
# 3 0.806812 0.923432
# 4 0.224321 0.622174
# 5 0.504660 0.185039