Search code examples
rggplot2dplyrtidyversehistogram

ggplot stacked percentage histogram


I have data like this:

class   subclass percent
A   X   7.75    
A   Y   7.75
B   Z   1.25    
B   Z   1.25    
B   T   1.25    

I want to plot a the histogram, classes on x-axis and percents on the y-axis, and bars filled according to the subclass. So for the given example data the histogram should have 2 bars for A and B, 2 values on y, (7.75 for A and 1.25 for B) and the A bar should be divided into 2 groups (50/50 for X and Y) and B bar should be divided into 3 groups (66% Z and 33% T).

I tried using ggplot and geom_histogram:

data %>%
  ggplot(aes(x=reorder(class,-percent),
             y = percent,
             fill = subclass)) +
  geom_histogram(stat='identity') + 
  scale_y_continuous(labels = scales::percent)

This code sums up the percent values for the y axis, so instead of plotting 7.75, it plots 15.5 for A and 3.75 for B. Since the totals are wrong I dont know if the fill = subclass part is working. What am I doing wrong?

Thank you!!


Solution

  • First, what you want is a bar chart so use geom_col instead of geom_histogram. Second, as you percent column reflects the total percent per class, you have to divide by the number of observations per class so that the bars stack to the total. Third, I added a summarise step to compute the percent per class and subclass:

    data <- structure(list(class = c("A", "A", "B", "B", "B"), subclass = c(
      "X",
      "Y", "Z", "Z", "T"
    ), percent = c(7.75, 7.75, 1.25, 1.25, 1.25)), class = "data.frame", row.names = c(NA, -5L))
    
    library(ggplot2)
    library(dplyr, warn=FALSE)
    
    data <- data %>%
      group_by(class) %>%
      mutate(percent = percent / n()) %>%
      group_by(class, subclass) %>%
      summarise(percent = sum(percent))
    #> `summarise()` has grouped output by 'class'. You can override using the
    #> `.groups` argument.
    
    ggplot(data, aes(
      x = reorder(class, -percent),
      y = percent,
      fill = subclass
    )) +
      geom_col() +
      scale_y_continuous(labels = scales::label_percent(scale = 1))
    

    EDIT To add the label with the relative frequency of each subclass per class I would add another column to the dataset, which could then be added as labels via geom_text:

    data <- data %>%
      group_by(class) %>%
      mutate(percent = percent / n()) %>%
      group_by(class, subclass) %>%
      summarise(percent = sum(percent)) |> 
      mutate(label = percent / sum(percent))
    
    ggplot(data, aes(
      x = reorder(class, -percent),
      y = percent,
      fill = subclass
    )) +
      geom_col() +
      geom_text(aes(label = scales::percent(label)), position = position_stack(vjust = .5)) +
      scale_y_continuous(labels = scales::label_percent(scale = 1))
    

    enter image description here