Search code examples
rggplot2errorbar

Selecting filtered data for whiskers on an errorbar in ggplot2?


Sample of dataset:

sample <- structure(list(NAME = c("WEST YORKSHIRE", "WEST YORKSHIRE", "WEST YORKSHIRE", 
"WEST YORKSHIRE", "WEST YORKSHIRE", "WEST YORKSHIRE", "NOTTINGHAMSHIRE", 
"NOTTINGHAMSHIRE", "NOTTINGHAMSHIRE", "NOTTINGHAMSHIRE", "NOTTINGHAMSHIRE", 
"NOTTINGHAMSHIRE"), ACH_DATE = structure(c(17410, 17410, 17410, 
17440, 17440, 17440, 17410, 17410, 17410, 17440, 17440, 17440
), class = "Date"), MEASURE = c("DIAG_RATE_65_PLUS", "DIAG_RATE_65_PLUS_LL", 
"DIAG_RATE_65_PLUS_UL", "DIAG_RATE_65_PLUS", "DIAG_RATE_65_PLUS_LL", 
"DIAG_RATE_65_PLUS_UL", "DIAG_RATE_65_PLUS", "DIAG_RATE_65_PLUS_LL", 
"DIAG_RATE_65_PLUS_UL", "DIAG_RATE_65_PLUS", "DIAG_RATE_65_PLUS_LL", 
"DIAG_RATE_65_PLUS_UL"), VALUE = c(73.6, 66.2, 79.8, 73.7, 66.3, 
80, 77, 69.1, 83.6, 77.5, 69.6, 84.2)), class = c("tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -12L))

I'm trying to visualise the error bars for the points seen here:

sample %>% filter(MEASURE == "DIAG_RATE_65_PLUS") %>% ggplot(aes(x=ACH_DATE, y=VALUE, group=ACH_DATE)) +
  geom_dotplot(binaxis = "y", stackdir = "center", dotsize=0.2)

As you can see in the df the lower and upper limits are contained in a variable MEASURE with my point values of interest in a long format.

What I'm stuck is how I can filter the df further, to use the lower and upper limit values in the ymin and ymax arguments.

I've tried something like:

sample %>% filter(MEASURE == "DIAG_RATE_65_PLUS") %>% ggplot(aes(x=ACH_DATE, y=VALUE, group=ACH_DATE)) +
  geom_dotplot(binaxis = "y", stackdir = "center", dotsize=0.2) +
  geom_errorbar(aes(x = ACH_DATE,
                    ymin = sample %>% filter(MEASURE == "DIAG_RATE_65_PLUS_LL") %>% select(VALUE),
                    ymax = sample %>% filter(MEASURE == "DIAG_RATE_65_PLUS_UL") %>% select(VALUE)),
                data = sample %>% filter(MEASURE != "DIAG_RATE_65_PLUS"),
                colour="red")

Which throws the error: Error: Columns `ymin`, `ymax` must be 1d atomic vectors or lists. I've tried wrapping my input to the ymin and ymax arguments with as.vector, but that doesn't seem to help.


Solution

  • ggplot, like other tidyverse libraries, works with non-standard evaluation. It's expecting the bare names of data frame columns in arguments such as ymin. What you supplied is instead a data frame with only 1 column: dplyr::select returns a data frame/tibble with the given columns, hence the error about needing to supply a vector.

    sample %>% filter(MEASURE == "DIAG_RATE_65_PLUS_LL") %>% select(VALUE)
    #> # A tibble: 4 x 1
    #>   VALUE
    #>   <dbl>
    #> 1  66.2
    #> 2  66.3
    #> 3  69.1
    #> 4  69.6
    

    If you really wanted to use this method of having all your types of measures in one column and filtering for different types, dplyr::pull takes a single column name and returns the data in that column as a vector.

    However, there are multiple concerns you're trying to handle in this data frame that you probably ought to separate. You have observation values (means, medians, or whatever), you have upper confidence interval limits, and you have lower confidence interval limits. While the answer to ggplot issues is often long-shaping data, this is a case where these are three different concerns that have different places in your plot—therefore, you're better off making them individual columns. You can do this with tidyr::spread.

    library(dplyr)
    library(ggplot2)
    
    sample %>%
      tidyr::spread(key = MEASURE, value = VALUE)
    #> # A tibble: 4 x 5
    #>   NAME     ACH_DATE   DIAG_RATE_65_PL… DIAG_RATE_65_PLU… DIAG_RATE_65_PLU…
    #>   <chr>    <date>                <dbl>             <dbl>             <dbl>
    #> 1 NOTTING… 2017-09-01             77                69.1              83.6
    #> 2 NOTTING… 2017-10-01             77.5              69.6              84.2
    #> 3 WEST YO… 2017-09-01             73.6              66.2              79.8
    #> 4 WEST YO… 2017-10-01             73.7              66.3              80
    

    And then use those separate columns that have separate purposes for the corresponding parts of your geoms.

    sample %>%
      tidyr::spread(key = MEASURE, value = VALUE) %>%
      ggplot(aes(x = ACH_DATE, y = DIAG_RATE_65_PLUS, group = ACH_DATE)) +
        geom_dotplot(binaxis = "y") +
        geom_errorbar(aes(ymin = DIAG_RATE_65_PLUS_LL, ymax = DIAG_RATE_65_PLUS_UL))
    #> `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
    

    Created on 2018-10-01 by the reprex package (v0.2.1)