Search code examples
rggplot2histogram

How to log transform the y-axis of R geom_histogram in the right direction?


I'm forgetting something very fundamental which would explain why I'm seeing very inflated y values after a log10 transformation of the y-axis.

I have the following stacked ggplot + geom_histogram.

ggTherapy <- ggplot(genderTherapyDF, aes(freq, fill=name)) +
 geom_histogram(data=genderTherapyDF, binwidth = 1, alpha=0.5, color="black") + theme_bw() +
 theme(legend.position="none", axis.title = element_text(size=14), legend.text = element_text(size=14), axis.text.y = element_text(size=12, angle=45), axis.text.x = element_text(size=12), legend.background = element_rect(fill="transparent")) +
 ylab("No. of patients") + xlab("Events") + labs(fill="") +  ggtitle("Therapy")

enter image description here

The y-values are true to form, exactly what I expect. However, it's so skewed that to the naked eye I'm finding this very unsatisfying. I'd rather see a transformed plot.

I tried transforming x, quickly to realise that transforming along the binned axis was very difficult to interpret. So I transformed the frequency on the y axis:

ggTherapy <- ggplot(genderTherapyDF, aes(freq, fill=name)) +
 geom_histogram(data=genderTherapyDF, binwidth = 1, alpha=0.5, color="black") + theme_bw() +
 theme(legend.position="none", axis.title = element_text(size=14), legend.text = element_text(size=14), axis.text.y = element_text(size=12, angle=45), axis.text.x = element_text(size=12), legend.background = element_rect(fill="transparent")) +
 ylab("No. of patients") + xlab("Events") + labs(fill="") +  ggtitle("Therapy") +
scale_y_log10()

enter image description here

Visually, the plot makes sense. However, I'm struggling to come to terms with the y-axis labels! Why are they so huge after a log10 transformation?


Solution

  • I'm going to make a case against using a stacked position on a log transformed y axis.

    Consider the following data.

    df <- data.frame(
      x = c(1, 1),
      y = c(10, 10),
      z = c("A", "B")
    )
    

    It's just two equal observations from two groups sharing an x position. If we were to plot this in a stacked bar chart, it would look like the following:

    library(ggplot2)
    ggplot(df, aes(x, y, fill = z)) +
      geom_col(position = "stack")
    

    And this does exactly what you expect it would do. However, if we now transform the y-axis, we get the following:

    ggplot(df, aes(x, y, fill = z)) +
      geom_col(position = "stack") +
      scale_y_continuous(trans = "log10")
    

    In the plot above, it seems that group B has the value 10, which is correct and group A has the value 90, which is incorrect. The reason this happens is because position adjustments happen after statistical transformation, so instead of log10(A + B), you are getting log10(A) + log10(B), which is the same as log10(A * B), as top height.

    Instead, I'd recommend to not stack histograms if you plan on transforming the y-axis, but use the fill's alpha to tease them apart. Example below:

    df <- data.frame(
      x = c(rnorm(100, 1), rnorm(100, 2)),
      z = rep(c("A", "B"), each = 100)
    )
    
    ggplot(df, aes(x, fill = z)) +
      geom_histogram(position = "identity", alpha = 0.5) +
      scale_y_continuous(trans = "log10")
    #> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
    #> Warning: Transformation introduced infinite values in continuous y-axis
    

    Yes, the 0s will become -Inf but at least the y-axis is now correct.

    EDIT: If you want to filter out the -Inf observations, one nice thing in the scales v1.1.1 package is the oob_censor_any() function used as follows:

    scale_y_continuous(trans = "log10", oob = scales::oob_censor_any)