Search code examples
rggplot2geom-barerrorbaraesthetics

Grouped barplot with errorbars in ggplot2


Intro:

I'm running into trouble plotting some errorbars on a grouped barplot.

I'm trying to adapt some code that was originally used for a non-grouped barplot that I used to make the following figure:

enter image description here

The Problem:

Now I am trying to plot multiple air pollutants for each site.

I am melting two separate dataframes (one with mean values, one with confidence intervals), and then joining them together. I've successfully made a grouped barplot, however the errorbars look crazy!

How can I correctly map my errorbars so they mimic the non-grouped barplot above?

Reproducible Example:

See my entire data provenance below:

## mean values generated from raw data for each pollutant by site:
df.mean <- structure(list(id = structure(1:5, .Label = c("A", "B", "C", "D", "E"), class = "factor"), co_mean = c(0.00965315315315315, 0.201591548253404, 0.180300223214286, 0.14681828358209, 0.136609422703303), no_mean = c(2.09379071379071, 7.17386693309651, 5.11211979166667, 7.070375, 8.84492922564529), no2_mean = c(2.90698198198198, 15.3616940497336, 14.4540014880952, 17.8782126865672, 9.94047529836248), o3_mean = c(0.848970893970894, 19.6143709295441, 18.0919508928571, 19.1743544776119, 23.300829170136)), class = c("tbl_df", "tbl", "data.frame"), .Names = c("id", "co_mean", "no_mean", "no2_mean", "o3_mean"), row.names = c(NA, -5L))

## confidence intervals generated from raw data for each pollutant by site:
df.ci <- structure(list(id = structure(1:5, .Label = c("A", "B", "C", "D", "E"), class = "factor"), co_ci = c(0.00247560132518893, 0.00347796717254879, 0.00376771895817099, 0.025603853701267, 0.00232362415184514), no_ci = c(0.955602056071903, 0.179936357209358, 0.166243603959864, 0.413094097187208, 0.20475667069271), no2_ci = c(0.975169763947207, 0.251717055459865, 0.230073674418165, 0.479358833879918, 0.148588790912564), o3_ci = c(0.22710620006376, 0.283390020715785, 0.279702181925963, 0.754017640698111, 0.376479324970397)), class = c("tbl_df", "tbl", "data.frame"), .Names = c("id", "co_ci", "no_ci", "no2_ci", "o3_ci"), row.names = c(NA, -5L))

## convert each df to long-format:
df.mean.long <- melt(df.mean)
df.ci.long <- melt(df.ci)

## join two long dfs back together for plotting:
df.long.join <- full_join(df.mean.long, df.ci.long, by="id")

## generate confidence intervals relative to each mean:
limits <- aes(ymax = value.x + value.y, ymin = value.x-value.y) ## this is likely the problem!

## create our barplot:
barplot <- ggplot(df.long.join, aes(x=id, y=value.x, fill = variable.x)) + 
    geom_bar(position="dodge", stat="identity") + 
    geom_errorbar(limits, position = "dodge", width = 0.25)

barplot

Here's the output:

Thank you in advance!


Solution

  • Your join is adding extra rows, and therefore extra error bars, because there are four matching copies for each level of id in each data frame. The error bars also are not dodged by the same amount as the bars.

    The code below shapes the data so as to get the desired join and also uses faceting to avoid the need for a legend. You can switch the x variable and faceting variable, depending on which comparisons you want to highlight.

    To shape the data, the goal is to join on both id and pollutant, so we need to get each data frame in long format and get common pollutant names in each data frame.

    We first put df.mean in long format using gather (a tidyr function that is essentially the equivalent of melt from the reshape2 package). separate is there to give us a column with just the pollutant abbreviation, without _mean appended. Then we get rid of the unneeded mean column that created with separate (although we don't have to do this).

    Now we do the same thing to df.ci, but we also change the name of the value column to ci so that it will be different from the value column we created in df.mean.

    The left_join combines the two reshaped data frames into a single data frame ready for plotting.

    library(tidyverse)
    
    df.mean %>% 
      gather(key, value, -id) %>% 
      separate(key, c("pollutant", "mean")) %>%
      select(-mean) %>% 
      left_join(df.ci %>% 
                  gather(key, value, -id) %>% 
                  separate(key, c("pollutant", "ci")) %>% 
                  select(id, pollutant, ci=value)) %>% 
      ggplot(aes(x=pollutant, y=value, fill = pollutant)) + 
        geom_bar(position=position_dodge(0.95), stat="identity") + 
        geom_errorbar(aes(ymax=value + ci, ymin=value-ci), position = position_dodge(0.95), width = 0.25) +
        facet_grid(. ~ id) +
        guides(fill=FALSE)
    

    enter image description here