Search code examples
rdplyrpurrr

Using tidyverse to get descriptive results with nest and then count how many observations we have matching these criteria


Let's say I have a dataset from a regular school in which students from different living areas are tested in math, English, and science. You need to do a retest if your score is 1SD below the mean and you'll fail if your score is 2SD below the mean.

I can easily compute the means, standard deviation, and these cutoffs. I'm using the nest from the tidyverse package. However, I would like to discover how many students were 1SD below and 2SD below the mean.

image 1

However, I don't know how to do these count calculations to these results in an easy way.

Please check the dataset and the code I'm using to achieve the descriptive results:

library(tidyverse)
set.seed(123)
ds <- data.frame(quest = c(2,4,6),
                 living_area = c("rural","urban","mixed"),
                 math_sum = rnorm(120, 10,1),
                 english_sum = rnorm(120, 10,1),
                 science_sum = rnorm(120, 10,1)
)

ds %>% 
  select(quest, ends_with("sum")) %>% #get variable names
  pivot_longer(-quest) %>% #tranform into  long format
  nest_by(quest, name) %>% #nest
  mutate(
    n = map_dbl(data, ~nrow(data.frame(.))), #compute sample size
    mean = map_dbl(data, ~mean(.)), #get the means
    sd = map_dbl(data, ~sd(.)), #get sd
    below = mean-sd, #1 below
    failed = mean-2*sd)

ds %>% 
  filter(quest == 2 & english_sum <= 9.19) %>% nrow()

ds %>% 
  filter(quest == 2 & english_sum <= 9.39) %>% nrow()

ds %>% 
  filter(quest == 2 & english_sum <= 8.73) %>% nrow()

Solution

  • We can use data column to see how many students are below one and two sd.

    adding this two lines to the mutate call:

        oneSd_below = sum((mean - sd) > data[[1]]),
        twoSd_below = sum((mean - 2*sd) > data[[1]])
    
    library(tidyverse)
    
    set.seed(123)
    
    ds <- data.frame(quest = c(2,4,6),
                     living_area = c("rural","urban","mixed"),
                     math_sum = rnorm(120, 10,1),
                     english_sum = rnorm(120, 10,1),
                     science_sum = rnorm(120, 10,1)
    ) %>% as_tibble()
    
    ds %>%
      select(quest, ends_with("sum")) %>% #get variable names
      pivot_longer(-quest) %>% #tranform into  long format
      nest_by(quest, name) %>% 
      mutate(
        n = map_dbl(data, ~ nrow(data.frame(.))),
        #compute sample size
        mean = map_dbl(data, ~ mean(.)),
        #get the means
        sd = map_dbl(data, ~ sd(.)),
        #get sd
        below = mean - sd,
        #1 below
        failed = mean - 2 * sd,
        oneSd_below = sum((mean - sd) > data[[1]]),
        twoSd_below = sum((mean - 2*sd) > data[[1]])
      )
    #> # A tibble: 9 × 10
    #> # Rowwise:  quest, name
    #>   quest name         data     n  mean    sd below failed oneSd_below twoSd_below
    #>   <dbl> <chr>   <list<ti> <dbl> <dbl> <dbl> <dbl>  <dbl>       <int>       <int>
    #> 1     2 englis…  [40 × 1]    40 10.0  0.839  9.19   8.35           6           0
    #> 2     2 math_s…  [40 × 1]    40 10.2  0.805  9.39   8.59           7           0
    #> 3     2 scienc…  [40 × 1]    40  9.92 1.19   8.73   7.54           8           0
    #> 4     4 englis…  [40 × 1]    40 10.0  1.08   8.94   7.87           6           0
    #> 5     4 math_s…  [40 × 1]    40  9.90 0.870  9.03   8.16           6           0
    #> 6     4 scienc…  [40 × 1]    40  9.96 0.882  9.07   8.19           6           1
    #> 7     6 englis…  [40 × 1]    40  9.87 1.03   8.83   7.80           7           0
    #> 8     6 math_s…  [40 × 1]    40  9.95 0.992  8.96   7.96           6           1
    #> 9     6 scienc…  [40 × 1]    40 10.4  0.967  9.41   8.44           5           1
    

    Created on 2021-12-25 by the reprex package (v2.0.1)