Search code examples
rggplot2axisplotrix

axis.break and ggplot2 or gap.plot? plot may be too complex


I created a plot with ggplot2. It's about milk protein content. I have two groups and 4 treatments. I want to show the interaction between group and treatment, means and errorbars. The protein content starts at 2.6%. Now my y-axis starts there without a gap, but my supervisor wants to have one. I tried axis.break() of the plotrix library, but nothing happened. I tried to rebuild the graphic with gap.plot but I was not successful, but I must admit that I'm no R-hero.

Here's the code for my graphic:

Protein<-ggplot(data=D, aes(x=treat, y=Prot,group=group, shape=group))+
  geom_line(aes(linetype=group), size=1, position=position_dodge(0.2))+
  geom_point(size=3, position=position_dodge(0.2))+
  geom_errorbar(aes(ymin=Prot-Prot_SD,ymax=Prot+Prot_SD), width=.2,      
position=position_dodge(0.2))+ 
  scale_shape_discrete(name='group\n', labels=c('1\n(n =   
22,19,16,20)\n','2\n(n = 15,12,14,12)'))+
  scale_linetype_discrete(name="group\n", labels=c('control\n(n =   
22,19,16,20)\n','free-contact\n(n = 15,12,14,12)'))+
  scale_x_discrete(labels=c('0', '1', '2', '3'))+
  labs(x='\ntreatment', y='protein content (%)\n')
ProtStar<-Protein+annotate("text", x=c(1,2,3,4), y=c(3.25,3.25,3.25,3.25),   
label=c("Aa","Aa","Ab","Ba"), size=4)
plot(ProtStar)

Unfortunately I do not have enough reputation to post images, but you might see from the code that the graphic is complex.

It would be fantastic if you would have useful suggestions. Thanks a lot!


Solution

  • TL;DR: Look at the bottom.

    Consider these figures:

    ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() + 
      theme_classic()
    

    enter image description here

    This is your basic plot. Now you have to consider the Y-axis.


    ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() + 
      theme_classic() +
      scale_y_continuous(limits = c(0,NA), expand = c(0,0))
    

    enter image description here

    This is the least misleading way of emphasizing that there is a zero floor to the data, even if there are no actual points below a certain value. Percent milk protein is a good example of data where negative values are impossible and you want to emphasize that, but that no observations were near zero.

    This also shrinks the explanatory range of the Y axis, so that there's less difference between the observations. If this is something you want to emphasize, that can be good. But if the natural range of some data is narrow, including the zero (and the resulting empty space) is misleading. For example, if milk protein is always between 2.6% and 2.7%, then the zero value is not a true floor for the data, but just as impossible as -50%.


    ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() + 
      theme_classic() +
      scale_y_continuous(limits = c(0,NA), expand = c(0,0)) + 
      theme(axis.line.y = element_blank()) +
      annotate(geom = "segment", x = -Inf, xend = -Inf, y = -Inf, yend = Inf) 
    

    enter image description here

    There are many reasons not to include a broken Y axis. It's perceived by many as being unethical or misleading to include one inside ranges of data. But this particular case is at the outer limit, beyond the actual data. I think the rules can be bent a bit for that.

    The first step is to remove the automatic Y axis line and draw it in "by hand" using annotate. Notice that the figure looks identical to the one previous. If your theme of choice uses a lot of different sizes, you're gonna have a bad time.


    ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() + 
      theme_classic() + 
      scale_y_continuous(limits = c(3.5,NA), expand = c(0,0), 
                         breaks = c(3.5, 4:7)) + 
      theme(axis.line.y = element_blank()) +
      annotate(geom = "segment", x = -Inf, xend = -Inf, y = -Inf, yend = Inf)
    

    enter image description here

    Now you can consider where the actual data begin and where is a good spot to put the break. You have to check by hand; e.g. min(iris$Sepal.Length) and consider where the tick marks will go. This is a personal judgment call.

    I found that the lowest value was at 4.3. I knew I wanted the break to be below the minimum, and I wanted the break to be about 0.5 units long. So I chose to put a tick mark at 3.5, and then each integer afterwards with breaks = c(3.5, 4:7).


    ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() + 
      theme_classic() + 
      scale_y_continuous(limits = c(3.5,NA), expand = c(0,0), 
                         breaks = c(3.5, 4:7), labels = c(0, 4:7)) + 
      theme(axis.line.y = element_blank()) +
      annotate(geom = "segment", x = -Inf, xend = -Inf, y = -Inf, yend = Inf)
    

    enter image description here

    Now we need to relabel the 3.5 tick to be a fake zero with labels = c(0, 4:7).


    ggplot(iris, aes(Species, Sepal.Length)) + geom_boxplot() + 
      theme_classic() + 
      scale_y_continuous(limits = c(3.5,NA), expand = c(0,0), 
                         breaks = c(3.5, 4:7), labels = c(0, 4:7)) + 
      theme(axis.line.y = element_blank()) +
      annotate(geom = "segment", x = -Inf, xend = -Inf, y = -Inf, yend = Inf) +
      annotate(geom = "segment", x = -Inf, xend = -Inf, y =  3.5, yend = 4,
               linetype = "dashed", color = "white")
    

    enter image description here

    Now we draw on a white dotted line over the manually-drawn axis line, going from our fake zero (y=3.5) to the lowest true tick mark (y=4).

    Consider that the grammar of graphics is a mature philosophy; that is to say, each element has thoughtful reasoning behind it. The fact that this is finicky to do is for good reasons, and you need to consider whether your own reasons are sufficient weight on the other side.