r dataframe ggplot2 time-series errorbar

Plot time series with known error (ggplot2)

I'm working with American Community Survey (ACS) 1-year estimates for a specific location over several years. For example, I'm trying to plot how the proportion of men and women riding a bike to work changes over time. From the ACS, I get estimates and standard error, which I can then use to calculate the upper and lower bounds of the estimates.

So the simplified data structure in wide format is like this:

| Year | EstimateM | MaxM | MinM | EstimateF | MaxF | MinF |
|------|-----------|------|------|-----------|------|------|
| 2005 | 3.0       | 3.5  | 2.5  | 2.0       | 2.3  | 1.7  |
| 2006 | 3.1       | 3.5  | 2.6  | 2.0       | 2.3  | 1.7  |
| 2007 | 5.0       | 4.2  | 5.8  | 2.5       | 3.0  | 2.0  |
| ...  | ...       | ...  | ...  | ...       | ...  | ...  |

If I only wanted to plot the estimates, I'd melt the data with only the two Estimate variables as measure.vars

GenderModeCombined_long <- melt(GenderModeCombined,
                            id = "Year",
                            measure.vars = c("EstimateM",
                                             "EstimateF")

The long data can then be easily plotted with ggplot2

ggplot(data=GenderModeCombined_long,
  aes(x=year, y=value, colour=variable)) +
  geom_point() +
  geom_line()

This produces a graph like so

Imgur

(sorry, don't have enough rep to post images)

Where I'm stuck is how to add error bars to the two estimate graphs. I could add them as measure vars to the melted dataset, but then how do I tell ggplot what should be plotted as values and what as error bars? Do I have to create a separate data frame with just the min/max data and then load that separately?

geom_errorbar(data = errordataMmax, aes(ymax = ??, ymin = ??))

I have the feeling that I'm somehow approaching this the wrong way and/or have my data set up the wrong way.

Solution

Welcome to SO. The problem here is that you have three "explicit" variables (Estimate, Min and Max) and an "implicit" one (gender) which is coded in column names. A way to solve this is to make "gender" an explicit grouping variable. After you go to long format, create a "gender" variable, remove the indication of gender from the key column (variable) and then go back to wide format. Something like this would work:

library(ggplot2)
library(dplyr)
library(tidyr)
library(tibble)

GenderModeCombined <- tibble::tribble(
  ~Year,   ~EstimateM,   ~MaxM,   ~MinM,   ~EstimateF,   ~MaxF,   ~MinF,  
  2005,         3.0,    3.5,    2.5,         2.0,    2.3,    1.7,  
  2006,         3.1,    3.5,    2.6,         2.0,    2.3,    1.7,  
  2007,         5.0,    4.2,    5.8,         2.5,    3.0,    2.0
)

GenderModeCombined.long <- GenderModeCombined %>% 
  # switch to long format
  tidyr::gather(variable, value, -Year,  factor_key = TRUE) %>% 
  # add a gender variable
  dplyr::mutate(gender   = stringr::str_sub(variable, -1)) %>% 
  # remove gender indication from the key column `variable`
  dplyr::mutate(variable = stringr::str_sub(variable, end = -2)) %>%
  # back to wide format
  tidyr::spread(variable, value)

GenderModeCombined.long
#> # A tibble: 6 x 5
#>    Year gender Estimate   Max   Min
#>   <dbl> <chr>     <dbl> <dbl> <dbl>
#> 1  2005 F           2     2.3   1.7
#> 2  2005 M           3     3.5   2.5
#> 3  2006 F           2     2.3   1.7
#> 4  2006 M           3.1   3.5   2.6
#> 5  2007 F           2.5   3     2  
#> 6  2007 M           5     4.2   5.8

ggplot(data=GenderModeCombined.long,
       aes(x=Year, y=Estimate,colour = gender)) +
  geom_point() +
  geom_line() + 
  geom_errorbar(aes(ymax = Max, ymin = Min))

^{Created on 2018-12-29 by the reprex package (v0.2.1)}