I want to add boxplots to a line graph in base R.
I have a rather big data set, and I want to create many different plots, which is why I created a subset for every plot I wanted to create.
Here is a very stripped-down version of my problem:
cat <- rep(1:10,10)
#create 10 categories
x1 <- rnorm(cat)
x2 <- rnorm(cat)
x3 <- rnorm(cat)
#create 3 X variables
#create data frame
cat <- 1:10
plot_data <- data.frame(
y = cat,
x1_mean = tapply(test_data$x1,test_data$cat,mean),
x2_mean = tapply(test_data$x2, test_data$cat,mean),
x3_mean = tapply(test_data$x3, test_data$cat,mean)
#create data frame for plotting
#mean for every x variable per category
plot(plot_data$y,plot_data$x1_mean,type = "b", xlab = "x", ylab = "y", col = "blue", ylim = c((-1),1))
lines(plot_data$y,plot_data$x2_mean, type = "b",col = "green")
lines(plot_data$y,plot_data$x3_mean, type = "b",col = "red")
#plot means per category
I use the data frame data_plot
to plot my line graph. And it looks like this:
Now I would like to add a boxplot of the distribution of every x-variable per category (cat
For this, I probably have to capture the information needed in the boxplot in the data frame data_plot
and then use it to plot the boxplot.
Do you have any idea on how to do this?
Here's how I would approach it.
While the means of the three x
variables for one category can be placed at the same x coordinate (1 to 10), the boxplots should be drawn at different x coordinates not to cover one another. This could be done using a loop over the three x
variables or by transforming the data into long format with a single x
variable and another variable indicating to which cat
the value belongs. We will go with the latter approach.
# put all x variables in one column
test_data_long <- reshape(test_data, direction="long", varying=list(2:4),
v.names="x", timevar="xn")
# category and variant of x (xn) together for placing boxplots along the x axis
test_data_long$cat2 <- with(test_data_long, cat + (xn - 2)/5)
# reorder so that boxplots are drawn in the correct order
test_data_long <- test_data_long[order(test_data_long$cat2), ]
The first command just transforms the data into long format with a single x
variable and an xn
variable which indicates the original index of x
1, x
2, x
Then we create the cat2
variable which distinguishes between xn
within cat
and is used as the x coordinate for the boxplots. We want to have x2
values exactly at cat
(1, 2, ..., 10), x1
values to the left (0.8, 1.8, ..., 9.8) and x2
values to the right (1.2, 2.2, ..., 10.2), which is achieved with cat + (xn - 2) / 5
. This spacing is specifically for three x
variables (max(xn)
= 3) and generally it could be something like cat + (xn - (max(xn) + 1)/2) / (max(xn) + 2)
Next we just need to reorder the data according to cat2
so that the boxplots are drawn in the same order the x values we will provide are in.
And now the plotting itself. With different layers of information in one plot I like to suppress the less important ones with transparency.
xn_col <- palette()[4:2]
box_col <- xn_col[unique(test_data_long$xn)]
# draw boxplots in semi-transparent colors
par(mar=c(4.5, 4.5, .5, .5))
boxplot(x ~ cat2, test_data_long, at=unique(test_data_long$cat2),
boxwex=.15, xaxt='n', xlab='cat',
border=paste0(box_col, '66'), col=paste0(box_col, '11'))
axis(1, at=unique(test_data$cat))
# compute mean values
data_mean <- aggregate(. ~ cat, test_data, mean)
# draw semi-transparent points under the means to make them pop out more
sapply(data_mean[, -1], points, x=data_mean$cat,
pch=16, col='#ffffffcc', cex=2)
# draw mean values
for (cat in 1:3) {
lines(data_mean$cat, data_mean[[paste0('x', cat)]], type='b',
col=xn_col[cat], lwd=1.7, cex=1.2)