Search code examples
rggplot2histogram

Can anyone explain why creating a histogram with two conditions shows incorrect distribution in R?


I want to create a histogram with data from two different conditions (A and B in the example below). I want to plot both distributions in the same plot using geom_histogram in R.

However, it seems that for condition A, the distribution of the whole data set is shown (instead of only A).

In the example below, three cases are shown:

  1. Plotting A and B
  2. Plotting only A
  3. Plotting only B

You will see that the distribution of A is not the same when you compare 1) and 2).

Can anyone explain why this occurs and how to fix this problem?

set.seed(5)

# Create test data frame 
test <- data.frame(
  condition=factor(rep(c("A", "B"), each=200)),
  value =c(rnorm(200, mean=12, sd=2.5), rnorm(200, mean=13, sd=2.1))
)

# Create separate data sets
test_a <- test[test$condition == "A",]
test_b <- test[test$condition == "B",]

# 1) Plot A and B
ggplot(test, aes(x=value, fill=condition)) +
  geom_histogram(binwidth = 0.25, alpha=.5) +
  ggtitle("Test A and AB")

# 2) Plot only A
ggplot(test_a, aes(x=value, fill=condition)) +
  geom_histogram(binwidth = 0.25, alpha=.5) +
  ggtitle("Test A")

# 3) Plot only B
ggplot(test_b, aes(x=value, fill=condition)) +
  geom_histogram(binwidth = 0.25, alpha=.5) +
  ggtitle("Test B")

Solution

  • An alternative for visualization, not to supplant MichaelDewar's answer:

    ggab <- ggplot(test, aes(x=value, fill=condition)) +
      geom_histogram(binwidth = 0.25, alpha=.5, position = "identity") +
      ggtitle("Test A and AB") +
      xlim(5, 20) +
      ylim(0, 13)
    
    # 2) Plot only A
    gga <- ggplot(test_a, aes(x=value, fill=condition)) +
      geom_histogram(binwidth = 0.25, alpha=.5) +
      ggtitle("Test A") +
      xlim(5, 20) +
      ylim(0, 13)
    
    # 3) Plot only B
    ggb <- ggplot(test_b, aes(x=value, fill=condition)) +
      geom_histogram(binwidth = 0.25, alpha=.5) +
      ggtitle("Test B") +
      xlim(5, 20) +
      ylim(0, 13)
    
    library(patchwork) # solely for a quick side-by-side-by-side presentation
    gga + ggab + ggb + plot_annotation(title = 'position = "identity"')
    

    enter image description here

    The key in this visualization is adding position="identity" to the first hist (the others do not need it).

    Alternatively, one could use position="dodge" (this is best viewed on the console, it's a bit difficult on this small snapshot).

    enter image description here

    And for perspective, position = "stack", the default, showing "A" with a demonstrably altered histogram.

    enter image description here