I'm forgetting something very fundamental which would explain why I'm seeing very inflated y values after a log10 transformation of the y-axis.
I have the following stacked ggplot + geom_histogram.
ggTherapy <- ggplot(genderTherapyDF, aes(freq, fill=name)) +
geom_histogram(data=genderTherapyDF, binwidth = 1, alpha=0.5, color="black") + theme_bw() +
theme(legend.position="none", axis.title = element_text(size=14), legend.text = element_text(size=14), axis.text.y = element_text(size=12, angle=45), axis.text.x = element_text(size=12), legend.background = element_rect(fill="transparent")) +
ylab("No. of patients") + xlab("Events") + labs(fill="") + ggtitle("Therapy")
The y-values are true to form, exactly what I expect. However, it's so skewed that to the naked eye I'm finding this very unsatisfying. I'd rather see a transformed plot.
I tried transforming x, quickly to realise that transforming along the binned axis was very difficult to interpret. So I transformed the frequency on the y axis:
ggTherapy <- ggplot(genderTherapyDF, aes(freq, fill=name)) +
geom_histogram(data=genderTherapyDF, binwidth = 1, alpha=0.5, color="black") + theme_bw() +
theme(legend.position="none", axis.title = element_text(size=14), legend.text = element_text(size=14), axis.text.y = element_text(size=12, angle=45), axis.text.x = element_text(size=12), legend.background = element_rect(fill="transparent")) +
ylab("No. of patients") + xlab("Events") + labs(fill="") + ggtitle("Therapy") +
scale_y_log10()
Visually, the plot makes sense. However, I'm struggling to come to terms with the y-axis labels! Why are they so huge after a log10 transformation?
I'm going to make a case against using a stacked position on a log transformed y axis.
Consider the following data.
df <- data.frame(
x = c(1, 1),
y = c(10, 10),
z = c("A", "B")
)
It's just two equal observations from two groups sharing an x position. If we were to plot this in a stacked bar chart, it would look like the following:
library(ggplot2)
ggplot(df, aes(x, y, fill = z)) +
geom_col(position = "stack")
And this does exactly what you expect it would do. However, if we now transform the y-axis, we get the following:
ggplot(df, aes(x, y, fill = z)) +
geom_col(position = "stack") +
scale_y_continuous(trans = "log10")
In the plot above, it seems that group B has the value 10, which is correct and group A has the value 90, which is incorrect. The reason this happens is because position adjustments happen after statistical transformation, so instead of log10(A + B)
, you are getting log10(A) + log10(B)
, which is the same as log10(A * B)
, as top height.
Instead, I'd recommend to not stack histograms if you plan on transforming the y-axis, but use the fill's alpha to tease them apart. Example below:
df <- data.frame(
x = c(rnorm(100, 1), rnorm(100, 2)),
z = rep(c("A", "B"), each = 100)
)
ggplot(df, aes(x, fill = z)) +
geom_histogram(position = "identity", alpha = 0.5) +
scale_y_continuous(trans = "log10")
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> Warning: Transformation introduced infinite values in continuous y-axis
Yes, the 0s will become -Inf
but at least the y-axis is now correct.
EDIT: If you want to filter out the -Inf
observations, one nice thing in the scales v1.1.1 package is the oob_censor_any()
function used as follows:
scale_y_continuous(trans = "log10", oob = scales::oob_censor_any)