Search code examples
rdictionarydplyrpurrr

How to create confidence intervals for multiple columns using purrr and dplyr packages in R?


I would love to have your inputs about how to use the purrr package for multiple columns. In specific, I want to do some basic operations to create the confidence interval (lower and upper) for each of the variables mass and birth_year by skin_color, from the starwars database.

What I have done so far is:

  1. Calculate the number of observations different to NA foreach of the columns of my interest (mass and birth_year) by skin_color.
pacman::p_load("tidyr","purrr")

data <- starwars
data_obs <- 
  data %>%
  dplyr::select(mass,birth_year,skin_color) %>%
  dplyr::group_by(skin_color)%>%
  dplyr::summarise_all(funs(sum(!is.na(.))))%>%
  dplyr::ungroup()
  1. I created a database that estimates the mean and standard deviation foreach variable of interest by skin_color.
data_stats <- 
  data %>%
  dplyr::select(mass,birth_year,skin_color)%>%
  dplyr::group_by(skin_color) %>%
  dplyr::summarise_all(., list(mean,sd)
                       , na.rm=T
                       )%>%
  dplyr::ungroup()
  1. I merged both databases and in that way I have the number of observations different from NA, the mean, and the sd foreach of the columns.
data_complete <-
  dplyr::inner_join(data_obs,data_stats, by="skin_color")

From here, it is easy to estimate the standard error foreach variable manually by:

data_complete <-
  dplyr::mutate(mass_se = mass_sd/sqrt(mass_n), 
                mass_ci_upper = mass_mean + qt(1 - (0.05 / 2), mass_n - 1)*mass_se,
                mass_ci_lower = mass_mean - qt(1 - (0.05 / 2), mass_n - 1)*mass_se)

However, since this is a lot of work for my real dataset (with more than 50 variables), I would like to use the purrr package. I have tried by doing:

list_vectors <- list(data$mass,data$birth_year)
list_ready <- map(list_vectors, 
                  ~ data %>%
                    group_by(skin_color)%>%
                    dplyr::summarise_all(funs(sum(!is.na(.))))%>%
                    dplyr::summarise_all(., list(mean,sd), na.rm=T) %>%
                    dplyr::ungroup()%>%
                    dplyr::mutate(var_se=var_sd/sqrt(var_n))) 

vector_1 <- list_ready[[1]]

But this doesn't work. Any help is really really appreciated! (I really want to use the purrr package).


Solution

  • Simplest way might be to a) put your calculation steps into a function processing a vector and returning a list of a tibble with the values you need and b) passing this into across instead (using iris as an example instead):

    library(tidyverse)
    
    mean_ci <- function(vars) {
      vars <- vars[!is.na(vars)]
      mn <- mean(vars)
      se <- sd(vars) / sqrt(length(vars))
      tibble(
        mean = mn,
        lower = mn - qt(1 - (0.05 / 2), length(vars) - 1) * se,
        upper = mn + qt(1 - (0.05 / 2), length(vars) - 1) * se
      )
    }
    
    iris |>
      group_by(Species) |>
      summarise(across(where(is.numeric), mean_ci)) |> 
      unnest(where(is_tibble), names_sep = "_")
    
    #> # A tibble: 3 × 13
    #>   Species    Sepal.Len…¹ Sepal…² Sepal…³ Sepal…⁴ Sepal…⁵ Sepal…⁶ Petal…⁷ Petal…⁸
    #>   <fct>            <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
    #> 1 setosa            5.01    4.91    5.11    3.43    3.32    3.54    1.46    1.41
    #> 2 versicolor        5.94    5.79    6.08    2.77    2.68    2.86    4.26    4.13
    #> 3 virginica         6.59    6.41    6.77    2.97    2.88    3.07    5.55    5.40
    #> # … with 4 more variables: Petal.Length_upper <dbl>, Petal.Width_mean <dbl>,
    #> #   Petal.Width_lower <dbl>, Petal.Width_upper <dbl>, and abbreviated variable
    #> #   names ¹​Sepal.Length_mean, ²​Sepal.Length_lower, ³​Sepal.Length_upper,
    #> #   ⁴​Sepal.Width_mean, ⁵​Sepal.Width_lower, ⁶​Sepal.Width_upper,
    #> #   ⁷​Petal.Length_mean, ⁸​Petal.Length_lower
    

    A more purrr-y way to do it might be to map the function to nested dataframes to create a slightly longer-form data output:

    iris |> 
      nest(data = -Species) |> 
      mutate(data = map(data, ~tibble(metric = names(.x), map_df(.x, mean_ci)))) |> 
      unnest(data) 
    
    #> # A tibble: 12 × 5
    #>    Species    metric        mean lower upper
    #>    <fct>      <chr>        <dbl> <dbl> <dbl>
    #>  1 setosa     Sepal.Length 5.01  4.91  5.11 
    #>  2 setosa     Sepal.Width  3.43  3.32  3.54 
    #>  3 setosa     Petal.Length 1.46  1.41  1.51 
    #>  4 setosa     Petal.Width  0.246 0.216 0.276
    #>  5 versicolor Sepal.Length 5.94  5.79  6.08 
    #>  6 versicolor Sepal.Width  2.77  2.68  2.86 
    #>  7 versicolor Petal.Length 4.26  4.13  4.39 
    #>  8 versicolor Petal.Width  1.33  1.27  1.38 
    #>  9 virginica  Sepal.Length 6.59  6.41  6.77 
    #> 10 virginica  Sepal.Width  2.97  2.88  3.07 
    #> 11 virginica  Petal.Length 5.55  5.40  5.71 
    #> 12 virginica  Petal.Width  2.03  1.95  2.10