Search code examples
rggplot2facet

Displaying percentage of total gender group in each subgroup with ggplot and geomtext


I've tried everywhere to find the answer to this question but I am still stuck, so here it is:

I have a data frame data_1 that contains data from an ongoing latent profile analysis. The variables of interest for this question are profiles and gender.

I would like to plot gender distribution by profile, but within each profile show what % of each gender we have compared to the entire sample of this gender. For example, if we have 10 women and 5 in Profile 1, I want the text on top of the women bar for Profile 1 to show 50%.

Right now I am using the following code but it is giving me the percentage for the entire population, while I just want the percentage compared to the total number of women.

ggplot(data = subset(data_1, !is.na(gender)),
       aes(x = gender, fill = gender)) + geom_bar() +
  facet_grid(cols=vars(profiles)) + theme_minimal() +
  scale_fill_brewer(palette = 'Accent', name = "Gender", 
                    labels = c("Non-binary", "Man", "Woman")) +
  labs(x = "Gender", title = "Gender distribution per LPA profile") +
  geom_text(aes(y = ((..count..)/sum(..count..)), 
                label = scales::percent((..count..)/sum(..count..))), 
            stat = "count", vjust = -28)

Thanks in advance for your help!

I tried multiple alternatives including creating the variable within the dataset using summarize and mutate but with no success unfortunately.


Solution

  • As untidy as it seems, it's likely the best approach to summarise outside of the ggplot2 call, which can be done like this:

    library(tidyverse)
    
    data1 <- tibble(gender = sample(c("male", "female"), 100, replace = TRUE),
                    profile = sample(c("profile1", "profile2"), 100, replace = TRUE))
    
    data1 |> 
      count(gender, profile) |>
      group_by(gender) |> 
      mutate(perc = n / sum(n)) |> 
      ggplot(aes(x = gender, y = n, fill = gender)) +
      geom_col() +
      facet_grid(~profile) +
      geom_text(aes(y = n + 3, label = scales::percent(perc)))
    

    The facet_grid is essentially grouping the dataset by profile before doing any calculations of values, so in essence it's blind to the data in the other facet. I think only approach is thus summarising before the call and using geom_col (defaulting to stat = "identity") to make the plots. Note that the y value for the labels is calculated from the count variable - R will position the text relative to the counted values of the bars.

    Edit - actually no, there's a "simpler" way

    I tell a lie, you can actually do it in the ggplot2 call, but it's a little messier:

    data1 |>
      ggplot(aes(x = gender, fill = gender)) +
      geom_bar() +
      facet_grid(~ profile) +
      stat_count(aes(y = after_stat(count) + 2,
                  label = scales::percent(after_stat(count) / 
                                          tapply(after_stat(count), 
                                                 after_stat(group), 
                                                 sum)[after_stat(group)]
                     )),
                 geom = "text")
    

    Code borrowed from here. The after_stat(group) part is accessing the grouped gender count across both facets. Today I learned something!