I'm trying to create a plot similar to this one. A facet grid with strip and boxplots overlapping. The data is stored in a pandas dataframe. My difference to the referenced question is, that on top of distributing the bars over the X axis, I'm also drawing multiple bars (and point strips) per X value via the hue
parameter. So far so good, this works.
The problem is, that the boxes and point strips do not align their vertical positions, as can be seen in the figure in the upper row in the first column as well as in the lower row in the second and last column. The corresponding boxes and point strips are mostly next to each other and even with varying offsets.
Here is my code so far with a dummy dataset:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
################### generate dummy data set ###################
np.random.seed(20240224)
numPoints = 300 # should be divisible by 3 and 2
df = pd.DataFrame({"CategoryX": np.random.randint(1, 4, numPoints),
"CategoryY": np.random.rand(numPoints),
# the imbalance here seems to be the problem trigger
"CategoryColor": np.random.choice([0,1,2,3], size=numPoints, p=[0.33, 0.33, 0.33, 0.01]),
"CategoryColumn": np.array(["ColA", "ColB", "ColC"] * (numPoints // 3)),
"CategoryRow": np.array(["RowA"] * (numPoints // 2) + ["RowB"] * (numPoints // 2)),
})
################### actual plot ###################
commonParams = dict(
x="CategoryX",
y="CategoryY",
hue="CategoryColor",
)
g = sns.catplot(
data=df,
**commonParams,
col="CategoryColumn",
row="CategoryRow",
kind="strip",
dodge=True,
)
# map by hand bc I couldn't figure out how to properly use map() or map_dataframe()
for i, s in enumerate(df['CategoryColumn'].unique()):
for j, f in enumerate(df['CategoryRow'].unique()):
sns.boxplot(
data=df[(df['CategoryColumn'] == s) & (df['CategoryRow'] == f)],
**commonParams,
ax=g.axes[j, i], # draw on the existing axes
legend=False,
)
Any help aligning this neatly on top of each other is highly appreciated!
Thank you for providing reproducible data.
Apparently, you only get a difference in order when the row or column names are numeric. Further, when some hue values are missing for one of the subplots, the boxplots (or any similer plot) get distributed without counting that hue value. When plotting one by one, Seaborn only sees the hue values of the subplot.
To mitigate, the category column can be made of type pd.Categorical
, which forces a fixed set of hue values, even when some are missing. Note that this also changes the default palette
to tab10
. If needed, palette
can be explicitly set to flare
.
(I also tested with hue_order
, but that only works if you also set a palette with the same number of colors, and unfortunately makes a mess of the figure legend. Tested with Seaborn 0.13.2 and Pandas 2.2.1)
Here is how the example would look like:
# change the column from numeric to pd.Categorical
df["CategoryColor"] = pd.Categorical(df["CategoryColor"])
commonParams = dict(
x="CategoryX",
y="CategoryY",
hue="CategoryColor",
palette='flare',
)
g = sns.catplot(
data=df,
**commonParams,
col="CategoryColumn",
row="CategoryRow",
kind="strip",
dodge=True,
)
for (row, col), ax in g.axes_dict.items():
sns.boxplot(
data=df[(df['CategoryColumn'] == col) & (df['CategoryRow'] == row)],
**commonParams,
ax=ax, # draw on the existing axes
legend=False,
boxprops={'alpha': 0.7} # transparency to see stripplot
)
plt.show()