I am currently working in R, attempting to create a panel of plots that each contain two overlaying histograms: a red histogram underneath a blue histogram. The red histogram contains the same data set in each plot and thus should be displayed consistently across the board. I have found that this is not so. The red histogram differs, despite the data being exactly the same in each plot. Is there a way to fix this? Am I missing something in my code that is causing this inconsistency?
Here is the code I used to create the plots:
test<-rnorm(1000)
test<-as.data.table(test)
test[, type:="Sample"]
setnames(test, old="test", new="value")
test_2<-rnorm(750)
test_2<-as.data.table(test_2)
test_2[, type:="Sub Sample"]
setnames(test_2, old="test_2", new="value")
test_2_final<-rbind(test, test_2, fill=TRUE)
test_3<-rnorm(500)
test_3<-as.data.table(test_3)
test_3[, type:="Sub Sample"]
setnames(test_3, old="test_3", new="value")
test_3_final<-rbind(test, test_3, fill=TRUE)
test_4<-rnorm(250)
test_4<-as.data.table(test_4)
test_4[, type:="Sub Sample"]
setnames(test_4, old="test_4", new="value")
test_4_final<-rbind(test, test_4, fill=TRUE)
test_5<-rnorm(100)
test_5<-as.data.table(test_5)
test_5[, type:="Sub Sample"]
setnames(test_5, old="test_5", new="value")
test_5_final<-rbind(test, test_5, fill=TRUE)
test_6<-rnorm(50)
test_6<-as.data.table(test_6)
test_6[, type:="Sub Sample"]
setnames(test_6, old="test_6", new="value")
test_6_final<-rbind(test, test_6, fill=TRUE)
draws_750_p<-ggplot(data = test_2_final, aes(x=value, fill=type, color=type)) + geom_histogram(position="identity", alpha = 0.2, bins=30) + theme(plot.title = element_text(hjust = 0.5, size=10, face="plain"))
draws_500_p<-ggplot(data = test_3_final, aes(x=value, fill=type, color=type)) + geom_histogram(position="identity", alpha = 0.2, bins=30) + theme(plot.title = element_text(hjust = 0.5, size=10, face="plain"))
draws_250_p<-ggplot(data = test_4_final, aes(x=value, fill=type, color=type)) + geom_histogram(position="identity", alpha = 0.2, bins=30) + theme(plot.title = element_text(hjust = 0.5, size=10, face="plain"))
draws_100_p<-ggplot(data = test_5_final, aes(x=value, fill=type, color=type)) + geom_histogram(position="identity", alpha = 0.2, bins=30) + theme(plot.title = element_text(hjust = 0.5, size=10, face="plain"))
draws_50_p<-ggplot(data = test_6_final, aes(x=value, fill=type, color=type)) + geom_histogram(position="identity", alpha = 0.2, bins=30) + theme(plot.title = element_text(hjust = 0.5, size=10, face="plain"))
full_plot<-plot_grid(draws_750_p, draws_500_p, draws_250_p, draws_100_p, draws_50_p, ncol = 3, nrow = 2)
And here is a picture of the odd results I am describing: Notice how the distribution of the red histogram differs despite the data set being exactly the same in each set (in this example you can see it the most in the draws_250_p plot in the right hand corner)-
As I mentioned in a comment, the issue is that the bins being used are different for each plot. This means the same value can end up in a different bin. the default is to guess at reasonable bin boundaries based on the number of bins specified and the range of the data, but since the sub samples are different in each plot (and may start earlier or later than the main sample) the resulting boundaries will be different.
The solution is to specify the bin boundaries directly so they are the same in every plot. Here is an example of specifying the bin boundaries implicitly using a combination of binwidth
and boundary
. I have also taken the liberty of combining all of the values into a single dataframe so that they can be plotted at once using facet_wrap
, which has the advantage of aligning the axes of the individual facets and labelling them with the size of the subsample. The crucial point is in the call to geom_histogram
, though. You can hopefully see that the red distributions are the same in each facet now.
library(tidyverse)
test <- tibble(type = "Sample", value = rnorm(1000))
add_sub_sample <- function(n, df) {
sub_sample <- tibble(type = "Sub Sample", value = rnorm(n))
df %>%
rbind(sub_sample) %>%
mutate(sub_sample_n = n)
}
test_final <- c(750, 500, 250, 100, 50) %>%
map(add_sub_sample, test) %>%
bind_rows()
ggplot(test_final, aes(x = value, fill = type, colour = type)) +
geom_histogram(position = "identity", alpha = 0.2, binwidth = 0.2, boundary = 0) +
facet_wrap(~sub_sample_n) +
theme(plot.title = element_text(hjust = 0.5, size=10, face="plain"))
Created on 2021-07-14 by the reprex package (v1.0.0)