I’m trying to create a boxplot with individual and nested/grouped data. The dataset I use represents information for a number of households, where there is a distinction between 1-phase and 3-phase systems (#)
#NOTE Where the id appears only once, the household is single phased (1-phase) and duplicates are 3-phase system. Due to the duplicates, reading the csv-file via
pd.read_csv(..)
will extend the duplicate's names (i.e.1
,1.1
and1.2
).
Using the basic plot techniques delivers:
In [4]: VoltageProfileFile= pd.read_csv(dest + '/VoltageProfiles_' + str(PV_par['value_PV']) + '%PV.csv', dtype= 'float')
...: VoltageProfileFile.boxplot(figsize=(20,5), rot= 60)
...: plt.ylim(0.9, 1.1)
...: plt.show()
Out[4]:
The result is correct, but it would be clean to have only 1 tick representing 1, 1.1 and 1.2 or 5, 5.1, 5.2 etc.
I would like to clean this up by using a ‘categorical’ boxplot, where values from duplicates (3-phase systems) are grouped under the same id. I’m aware that seaborn enables users to use the hue parameter: sns.boxplot(x='',hue='', y='', data='')
to create categorical plots (Plotting with categorical data). However, I can’t figure out how to format my dataset in order to achieve this? I tried via pd.melt(..)
function (cfr. pandas.melt), but the resulting format changes the order in which the values appear (*)
(*) Every id is accompanied by a length up to a reference point, thus the order of appearance on the x-axis must remain.
What would be a good approach to tackle this problem? Ideally, the boxplot would group 3-phase systems under one id and display different colours for 1ph vs. 3ph systems.
Kind regards,
Rémy
For seaborn plotting, data should be structured in long format and not wide format as you have it with distinct indicators such as household, phase, value.
So consider actually letting Pandas rename columns 1, 1.1, 1.2 and then run pd.melt
into long format with adjustments of the generated household
and phase
columns using assign
where you split on .
and take the first and second parts respectively:
VoltageProfileFile_long = (pd.melt(VoltageProfileFile, var_name = 'phase')
.assign(household = lambda x: x['phase'].str.split("\\.").str[0].astype(int),
phase = lambda x: pd.to_numeric(x['phase'].str.split("\\.").str[1]).fillna(0).astype(int).add(1))
.reindex(['household', 'phase', 'value'], axis='columns')
)
Below is a demo with random data
Data (dumped to csv then read back in for pandas renaming process)
np.random.seed(111620)
VoltageProfileFile = pd.DataFrame([np.random.uniform(0.95, 1.05, 13) for i in range(50)],
columns = [1, 1, 1, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9])
VoltageProfileFile.to_csv('data.csv', index=False)
VoltageProfileFile = pd.read_csv('data.csv')
VoltageProfileFile.head(10)
# 1 1.1 1.2 2 3 ... 5.2 6 7 8 9
# 0 1.012732 1.042768 0.975577 0.965508 1.048544 ... 1.010898 1.008921 1.006769 1.019615 1.036926
# 1 1.013457 1.048378 1.025201 0.982988 0.995133 ... 1.024578 1.024362 0.985693 1.041609 0.995037
# 2 1.024739 1.008590 0.960278 0.956811 1.001739 ... 0.969436 0.953134 0.966851 1.031544 1.036572
# 3 1.037998 0.993246 0.970146 0.989196 0.959527 ... 1.015577 1.027020 1.038941 0.971666 1.040658
# 4 0.995877 0.955734 0.952497 1.040942 0.985759 ... 1.021805 1.044108 0.980657 1.034179 0.980722
# 5 0.994755 0.951557 0.986580 1.021583 0.959249 ... 1.046740 0.998429 1.027406 1.007391 0.989477
# 6 1.023979 1.043418 1.020745 1.006081 1.030413 ... 0.964579 1.035479 0.982969 0.953484 1.005889
# 7 1.018904 1.045440 1.003997 1.018295 0.954814 ... 0.955295 0.960958 0.999492 1.010163 0.985847
# 8 0.960913 0.982671 1.016659 1.030384 1.043750 ... 1.042720 0.972287 1.039235 0.969571 0.999418
# 9 1.017085 0.998049 0.989664 0.953420 1.018018 ... 0.953041 0.955883 1.004630 0.996443 1.017762
Plot (after same processing to generate VoltageProfileFile_long
)
sns.set()
fig, ax = plt.subplots(figsize=(8,4))
sns.boxplot(x='household', y='value', hue='phase', data=VoltageProfileFile_long, ax=ax)
plt.title('Boxplot of Values by Household and Phases')
plt.tight_layout()
plt.show()
plt.clf()
plt.close()