Search code examples
rdplyrtidyrr-factor

Keeping factor order after gather and summarise steps in tidyverse


I have over a hundred variables for which I'm trying to calculate frequency and percent. How can I maintain the factor order of each variables' values in the output? Please note that specifying the order for each variable outside the dataset is not practical as I have over 100 variables.

Example data:

df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
                 disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
df
  gender disease
1   male     yes
2 female     yes
3   male      no
4   <NA>    <NA>

Attempt:

df %>% gather(key, value, factor_key = T) %>%
  group_by(key, value) %>% 
  summarise(n=n()) %>%
  ungroup() %>%
  group_by(key) %>%
  mutate(percent=n/sum(n))

Output:

# A tibble: 6 x 4
# Groups:   key [2]
  key     value      n percent
  <fct>   <chr>  <int>   <dbl>
1 gender  female     1    0.25
2 gender  male       2    0.5 
3 gender  NA         1    0.25
4 disease no         1    0.25
5 disease yes        2    0.5 
6 disease NA         1    0.25

Desired output would order gender as male, female and disease as yes, no.


Solution

  • Update: if you use pivot_longer (the new gather), it retains the factor levels! You can also fine-tune the column types with arguments names_transform and values_transform in pivot_longer.

    library(tidyverse)
    df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
                     disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
    
    df %>% 
      pivot_longer(everything()) %>%
      group_by(name, value) %>% 
      summarise(n=n(), .groups = "drop_last") %>%
      mutate(percent=n/sum(n))
    #> # A tibble: 6 x 4
    #> # Groups:   name [2]
    #>   name    value      n percent
    #>   <chr>   <fct>  <int>   <dbl>
    #> 1 disease yes        2    0.5 
    #> 2 disease no         1    0.25
    #> 3 disease <NA>       1    0.25
    #> 4 gender  male       2    0.5 
    #> 5 gender  female     1    0.25
    #> 6 gender  <NA>       1    0.25
    

    Created on 2020-10-16 by the reprex package (v0.3.0)


    Because gather drops the factor for the value variable and summarise also appears to drop data frame attributes, you'll have to re-add them. You can re-add them in a semi-automated by reading in and combining the factor levels like this:

    library(tidyverse)
    df <- data.frame(gender=factor(c("male", "female", "male", NA), levels=c("male", "female")),
                     disease=factor(c("yes","yes","no", NA), levels=c("yes", "no")))
    
    df %>% 
      gather(key, value, factor_key = T) %>%
      group_by(key, value) %>% 
      summarise(n=n()) %>%
      ungroup() %>%
      group_by(key) %>%
      mutate(percent=n/sum(n),
             value = factor(value, levels = df %>% map(levels) %>% unlist())) %>%
      arrange(key, value)
    #> Warning: attributes are not identical across measure variables;
    #> they will be dropped
    #> `summarise()` regrouping output by 'key' (override with `.groups` argument)
    #> # A tibble: 6 x 4
    #> # Groups:   key [2]
    #>   key     value      n percent
    #>   <fct>   <fct>  <int>   <dbl>
    #> 1 gender  male       2    0.5 
    #> 2 gender  female     1    0.25
    #> 3 gender  <NA>       1    0.25
    #> 4 disease yes        2    0.5 
    #> 5 disease no         1    0.25
    #> 6 disease <NA>       1    0.25
    

    Created on 2020-10-16 by the reprex package (v0.3.0)