I am plotting continuous data (y-variable) based on several categorical variables (factors, x-variables) using boxplot and stripchart. For this purpose the default plotting functions provide a handy formula-based interface, where I can input data as: Response ~ Factor1 + Factor2 + ... and obtain combinations of Factor 1, Factor 2 etc as x-axis coordinates.
However, I am struggling to find out what these raw coordinate values are for my data, since I want to annotate some values in my plots.
Example:
data(iris)
iris[,"DummyFactor"] <- as.factor(c("First", "Second"))
boxplot(Sepal.Length ~ Species + DummyFactor, data = iris)
stripchart(Sepal.Length ~ Species + DummyFactor, data = iris, vertical=T, add=T, pch=16)
# y-axis values:
ys <- iris[,"Sepal.Length"]
# x-axis:
# How to obtain the x-axis values on my current plot?
Experimentally I found out that the x-values in this example are:
xs <- apply(model.matrix(~ -1 + Species + DummyFactor, data = iris), MARGIN=1, FUN=function(x) sum(c(1,2,3,3)[as.logical(x)]))
# Annotate a few examples, e.g. 7th, 100th and 120th observation
points(x=xs[c(7,100,120)], y=ys[c(7,100,120)], pch=16, col="red", cex=2)
iris[c(7,100,120),]
#> iris[c(7,100,120),]
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species DummyFactor
#7 4.6 3.4 1.4 0.3 setosa First
#100 5.7 2.8 4.1 1.3 versicolor Second
#120 6.0 2.2 5.0 1.5 virginica Second
... which works but seems hardly the correct way to approach this. Seems the formula-implementations of boxplot and stripchart are hidden from the user.
Is there an easy way to obtain these coordinates in a general case?
See the at
argument in ?boxplot
:
"numeric vector giving the locations where the boxplots should be drawn, [...]; defaults to 1:n where n is the number of boxes."
You can get the number of boxes from e.g. the names
slot in the boxplot
object (see 'Value' in ?boxplot
:
bp <- boxplot(Sepal.Length ~ Species + DummyFactor, data = iris)
bp
bp$names
The boxes are ordered so that the level of the first factor in your plot formula (Species) varies fastest, then the second (DummyFactor). Get the number of boxes:
length(bp$names)
Create a vector of the default x (at
) coordinates:
at <- seq_along(bp$names)
The same values could by obtained from:
at <- with(iris, seq_along(levels(interaction(Species, DummyFactor))))
Create a factor from the interaction between Species and DummyFactor. This will be used for subsetting 'at':
intr <- with(iris, interaction(Species, DummyFactor))
Add the x coordinates to the data frame:
iris$at <- at[intr]
Add points:
points(Sepal.Length ~ at, data = iris[c(7, 100, 120), ], pch = 16, col = "red", cex = 2)