Search code examples
rggplot2scaler-factor

Scaling issue on raincloud plot


I am trying to create a raincloud plot to show scores on sex, however it is subgrouping each point based on its score score&sex I want it to look like this image, where petal.length on grouped by species and not the length itself depictedlike this. I have code that has been working with other sets, however I am not sure what the issue is.

I have also check to see it the score scale is continuous or discrete, and it is continuous*

here is the code I am using in R:

  dplyr::group_by(sex) %>%
  dplyr::mutate(
    mean = mean(score),
    se = sd(score) / sqrt(length(score)),
    sex_y = paste0(sex, "\n(", n(), ")")
  ) %>%
  ungroup() %>%
  ggplot(aes(x = NIH_score, y = sex_y)) +
  stat_slab(aes(fill = sex)) +
  geom_point(aes(color = sex),shape = 16,
             position = ggpp::position_jitternudge(height = 0.125, width = 0, 
                                             y = -0.125,
                                             nudge.from = "jittered")) +
  scale_fill_brewer(palette = "Set1", aesthetics = c("fill", "color")) +
  geom_errorbar(aes(
    xmin = mean - 1.96 * se,
    xmax = mean + 1.96 * se
  ), width = 0.2) +
  stat_summary(fun = mean, geom = "point", shape = 16, size = 3.0) +
  theme_bw(base_size = 10) +
  theme(legend.position = "top") +
  labs(title = "Raincloud plot with ggdist", x = "score")```

Solution

  • It's not that your data is being grouped by x axis value. It's just that the bandwidth of the kernel density estimator is too small.

    Let's recreate your issue with essentially the same code but some made up data:

    library(tidyverse)
    library(ggdist)
    
    set.seed(1)
    df <- tibble(NIH_score = sample(2:8, 200, TRUE),
                 sex = sample(c("Male", "Female"), 200, TRUE),
                 score = NIH_score)
    
    df  %>%
      dplyr::group_by(sex) %>%
      dplyr::mutate(
        mean = mean(score),
        se = sd(score) / sqrt(length(score)),
        sex_y = paste0(sex, "\n(", n(), ")")
      ) %>%
      ungroup() %>%
      ggplot(aes(x = NIH_score, y = sex_y)) +
      stat_slab(aes(fill = sex), adjust = 0.1) +
      geom_point(aes(color = sex),shape = 16,
                 position = ggpp::position_jitternudge(height = 0.125, width = 0, 
                                                       y = -0.125,
                                                       nudge.from = "jittered")) +
      scale_fill_brewer(palette = "Set1", aesthetics = c("fill", "color")) +
      geom_errorbar(aes(
        xmin = mean - 1.96 * se,
        xmax = mean + 1.96 * se
      ), width = 0.2) +
      stat_summary(fun = mean, geom = "point", shape = 16, size = 3.0) +
      theme_bw(base_size = 10) +
      theme(legend.position = "top") +
      labs(title = "Raincloud plot with ggdist", x = "score")
    

    enter image description here

    But if we increase the bandwidth to, say, 2 inside stat_slab using the adjust parameter, we get:

    enter image description here

    It's not clear what it is about your settings or data that is giving such a narrow bandwidth (since neither is in your question), but you should be able to get the result you need by increasing adjust