Search code examples
rgraphicsformulaboxplotstripchart

Coordinates for formula based plotting of factors in boxplot, stripchart


I am plotting continuous data (y-variable) based on several categorical variables (factors, x-variables) using boxplot and stripchart. For this purpose the default plotting functions provide a handy formula-based interface, where I can input data as: Response ~ Factor1 + Factor2 + ... and obtain combinations of Factor 1, Factor 2 etc as x-axis coordinates.

However, I am struggling to find out what these raw coordinate values are for my data, since I want to annotate some values in my plots.

Example:

data(iris)
iris[,"DummyFactor"] <- as.factor(c("First", "Second"))
boxplot(Sepal.Length ~ Species + DummyFactor, data = iris)
stripchart(Sepal.Length ~ Species + DummyFactor, data = iris, vertical=T, add=T, pch=16)

# y-axis values:
ys <- iris[,"Sepal.Length"]
# x-axis:
# How to obtain the x-axis values on my current plot?

Experimentally I found out that the x-values in this example are:

xs <- apply(model.matrix(~ -1 + Species + DummyFactor, data = iris), MARGIN=1, FUN=function(x) sum(c(1,2,3,3)[as.logical(x)]))
# Annotate a few examples, e.g. 7th, 100th and 120th observation
points(x=xs[c(7,100,120)], y=ys[c(7,100,120)], pch=16, col="red", cex=2)
iris[c(7,100,120),]
#> iris[c(7,100,120),]
#    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species DummyFactor
#7            4.6         3.4          1.4         0.3     setosa       First
#100          5.7         2.8          4.1         1.3 versicolor      Second
#120          6.0         2.2          5.0         1.5  virginica      Second

... which works but seems hardly the correct way to approach this. Seems the formula-implementations of boxplot and stripchart are hidden from the user.

Boxplot/stripchart example

Is there an easy way to obtain these coordinates in a general case?


Solution

  • See the at argument in ?boxplot:
    "numeric vector giving the locations where the boxplots should be drawn, [...]; defaults to 1:n where n is the number of boxes."

    You can get the number of boxes from e.g. the names slot in the boxplot object (see 'Value' in ?boxplot :

    bp <- boxplot(Sepal.Length ~ Species + DummyFactor, data = iris)
    bp
    bp$names
    

    The boxes are ordered so that the level of the first factor in your plot formula (Species) varies fastest, then the second (DummyFactor). Get the number of boxes:

    length(bp$names)
    

    Create a vector of the default x (at) coordinates:

    at <- seq_along(bp$names)
    

    The same values could by obtained from:

    at <- with(iris, seq_along(levels(interaction(Species, DummyFactor))))
    

    Create a factor from the interaction between Species and DummyFactor. This will be used for subsetting 'at':

    intr <- with(iris, interaction(Species, DummyFactor))
    

    Add the x coordinates to the data frame:

    iris$at <- at[intr]
    

    Add points:

    points(Sepal.Length ~ at, data = iris[c(7, 100, 120), ], pch = 16, col = "red", cex = 2)