Search code examples
rggplot2linechartconfidence-intervaltrend

Calculating length of 95%-CI using dplyr


Last time I asked how it was possible to calculate the average score per measurement occasion (week) for a variable (procras) that has been measured repeatedly for multiple respondents. So my (simplified) dataset in long format looks for example like the following (here two students, and 5 time points, no grouping variable):

studentID  week   procras
   1        0     1.4
   1        6     1.2
   1        16    1.6
   1        28    NA
   1        40    3.8
   2        0     1.4
   2        6     1.8
   2        16    2.0
   2        28    2.5
   2        40    2.8

Using dplyr I would get the average score per measurement occasion

mean_data <- group_by(DataRlong, week)%>% summarise(procras = mean(procras, na.rm = TRUE))

Looking like this e.g.:

Source: local data frame [5 x 2]
        occ  procras
      (dbl)    (dbl)
    1     0 1.993141
    2     6 2.124020
    3    16 2.251548
    4    28 2.469658
    5    40 2.617903

With ggplot2 I could now plot the average change over time, and by easily adjusting the group_data() of dplyr I could also get means per sub groups (for instance, the average score per occasion for men and women). Now I would like to add a column to the mean_data table which includes the length for the 95%-CIs around the average score per occasion.

http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/ explains how to get and plot CIs, but this approach seems to become problematic as soon as I wanted to do this for any subgroup, right? So is there a way to let dplyr also include the CI (based on group size, ect.) automatically in the mean_data? After that it should be fairly easy to plot the new values as CIs into the graphs I hope. Thank you.


Solution

  • You could do it manually using mutate a few extra functions in summarise

    library(dplyr)
    mtcars %>%
      group_by(vs) %>%
      summarise(mean.mpg = mean(mpg, na.rm = TRUE),
                sd.mpg = sd(mpg, na.rm = TRUE),
                n.mpg = n()) %>%
      mutate(se.mpg = sd.mpg / sqrt(n.mpg),
             lower.ci.mpg = mean.mpg - qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg,
             upper.ci.mpg = mean.mpg + qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg)
    
    #> Source: local data frame [2 x 7]
    #> 
    #>      vs mean.mpg   sd.mpg n.mpg    se.mpg lower.ci.mpg upper.ci.mpg
    #>   (dbl)    (dbl)    (dbl) (int)     (dbl)        (dbl)        (dbl)
    #> 1     0 16.61667 3.860699    18 0.9099756     14.69679     18.53655
    #> 2     1 24.55714 5.378978    14 1.4375924     21.45141     27.66287