Search code examples
rggplot2boxplotviolin-plot

Have variable distance between geom_boxplot/geom_violine based on x-axis values


Problem

I have an x variable - for instance size that is meaningful so that something can be two times the size of something else. I want to relate x to y (some other variable). At the same time, due to sampling x does not vary continuously but is discrete because there are just a few different object types and all objects of the same type of the same size (e.g. size of 1, 3 or 10). I want to use geom's like geom_boxplot or geom_violin to dsiplay the relationship between x & y.

However, the problem is that: If I keep x numeric, then I am only getting one boxplot/violin. If I convert it to a factor (shown below), then the distance between the geom does not reflect the distance in x. For instance, the distance between 1 & 3 is the same as the distance between 3 & 10.

Is there a way to discretise the data but change the spacing so it reflects the actual difference on the x-axis and use those geoms?

Code

# Seed for reproducibility
set.seed(20230518)

# Create random data
n  <- 10
df <- data.frame(x = factor(rep(c(1, 3, 10), each = n)),
                 y = c(rnorm(n), rnorm(n), rnorm(n)))


# Box plot version
ggplot(df, aes(x = x, y = y)) + geom_boxplot() + geom_point()

# Violine plot verion
ggplot(df, aes(x = x, y = y)) + geom_violin() + geom_point()

Boxplots

enter image description here

Violins

enter image description here

Expected solution

The distance between the geoms reflects the difference in x. This be should similar to this just with geom_boxplot & geom_violin in addition to the points:

# Nnumeric
ggplot(df, aes(x = as.numeric(as.character(x)), y = y)) + geom_point()

enter image description here


Solution

  • Add missing factor levels, then set drop to FALSE, works for geom_violin, too.

    df$x <- factor(df$x, levels = 1:10)
    
    ggplot(df, aes(x = x, y = y)) + 
      geom_boxplot() + 
      geom_point() +
      scale_x_discrete(drop = FALSE)
    

    enter image description here

    Add breaks to hide other x values:

    scale_x_discrete(drop = FALSE, breaks = unique(sort(df$x)))
    

    enter image description here