Search code examples
rggplot2boxplotviolin-plot

Violin plots don't show expected density curves as expected when unique points are less than four


Background

I am playing box plots and violin plots with ggplot2, but I find some odd phenomena which happen only when the number of unique data are less than four. I am not very sure whether SO is the proper place for this thread, if not, please guild me to the right place.

Single data point: plot is not rendered

df <- data.frame(state = "bedtime", value = 100)

Box plot

ggplot(aes(x = state, y = value), data = df) + geom_boxplot() + geom_point()

enter image description here

Violin plot

ggplot(aes(x = state, y = value), data = df) + geom_violin()

Nothing. Received a warning message.

enter image description here

Two to three data points: plot is sometimes rendered

If it's not, it's like the case of single data point. If it's rendered, the quantile lines are inconsistent.

df <- data.frame(state = rep("after_meal", 4), value = rep(c(178, 162), each = 2))

Box plot

ggplot(aes(x = state, y = value), data = df) + geom_boxplot() + geom_point()

enter image description here

Violin plot

ggplot(aes(x = state, y = value), data = df) + geom_violin(draw_quantiles = c(0.25, 0.5, 0.75))

enter image description here

As you can see, the quantile lines are inconsistent with each other.

Questions

  1. Why isn't the violin plot showed when there's only one data point? I looked up kernel density estimation, and I thought there should be a very wide but flat violin. Are there other limitations or constraint in geom_violin? Or is it the rule of violin plots?
  2. Why are the 25% and 75% quantiles put at different places between a box plot and a violin plot in the second case?

Solution

  • A violin plot is a density estimate plot reflected along the vertical axis, and is different from a box plot in that a box plot shows the data itself.

    So as to your first question, with one point the density is infinite, because you request it at one specific point in space with a zero width, i.e. infinite height (to see this, replace geom_violin with geom_density.

    The second issue stems from the same thing: a box plot is more accurate for a small number of points, because a density estimation is continuous, and is not well-defined for a very short range.