Search code examples
rggplot2geom-bargeom-text

Plot labels of the mean of the dependent variable on a stacked bar plot by two categorical variables


I am using a gender-pay gap data from Glassdoor which is accessible from here.

I am trying include the labels for mean of the response variable, totalSalary, on a stacked barplot by the 5 different performance ratings.

This is the code below so far:

  geom_bar(stat = "summary", fun = "mean", width = 0.9, color = "black") +
  theme_bw() +
  labs(x = "Job Title", y = "Mean Total Salary", fill = "Gender") +
  theme(axis.title = element_text(size = 10, color = "blue"),
        axis.text = element_text(size = 8),
        legend.position = "top") +
  # geom_col() +
  # geom_text(aes(label = totalSalary), position = position_stack(vjust = 0.5), color = "white") +
  scale_fill_manual(values = c("#FF66CC", "blue")) +
  scale_y_continuous(labels = comma) +
  coord_flip() +
  facet_wrap( ~ perfEval)

This is the plot that I get.

enter image description here

What I want to show is to label the mean total salary for male and female employees and for each job title separately on the pink and blue colored bars.

Any help would be appreciated


Solution

  • In terms of comparing salaries by gender, a side-by-side comparison seems more practical (as already pointed out in the comments).

    Nevertheless - regarding the technical question of positioning the labels, here is one way of doing it. The tricky part is finding the center positions of the stacked bars.

    library(tidyverse)
    
    df <- readr::read_csv("~/data.csv")
    
    df_summary <- df %>% 
      group_by(gender, jobTitle, perfEval) %>%
      summarize(totalcomp = mean(basePay + bonus),
                totalcomp_label = paste0(round(totalcomp * 1e-3, 0), "k")) %>%
      ungroup() 
    
    df_plot <- df_summary %>% 
      left_join(
        # the messy part to find approriate label positions - there may be a solution with less pivoting steps
        df_summary %>%
          tidyr::pivot_wider(id_cols = c(jobTitle, perfEval), 
                             values_from = "totalcomp", names_from = "gender", values_fill = 0) %>%
          dplyr::mutate(labelpos_M = Male/2, labelpos_F = Male + Female/2) %>% 
          tidyr::pivot_longer(c(Female, Male), names_to = "gender") %>%
          dplyr::mutate(
            labelpos = case_when(gender == "Male" ~ labelpos_M,
                                 gender == "Female" ~ labelpos_F,
                                 TRUE ~ NA_real_)
          ) %>%
          dplyr::select(jobTitle, perfEval, gender, labelpos),
        by = c("jobTitle", "perfEval", "gender")
      ) 
    
    # A tibble: 98 x 6
    #   gender jobTitle       perfEval totalcomp totalcomp_label labelpos
    #   <chr>  <chr>             <dbl>     <dbl> <chr>              <dbl>
    # 1 Female Data Scientist        1   118479. 118k             164089.
    # 2 Female Data Scientist        2   105040. 105k             140556.
    # 3 Female Data Scientist        3   100275. 100k             149580.
    # 4 Female Data Scientist        4    87633. 88k              127996.
    # 5 Female Data Scientist        5   101449. 101k             142046.
    
    
    df_plot %>%
      ggplot() +
      geom_col(aes(y = jobTitle, x = totalcomp, fill = gender), width = 0.9, color = "black") +
      theme_bw() +
      labs(x = "Job Title", y = "Mean Total Salary", fill = "Gender") +
      theme(axis.title = element_text(size = 10, color = "blue"),
            axis.text = element_text(size = 8),
            legend.position = "top") +
      scale_fill_manual(values = c("#FF66CC", "blue")) +
      scale_x_continuous(labels = scales::comma) +
      facet_wrap( ~ perfEval) +
      # positioning the labels
      geom_text(aes(x = labelpos, y = jobTitle, label = totalcomp_label), 
                color = "white")
    
    

    enter image description here