Search code examples
rdplyriterationtidyevalnse

r- How to use iteration on a custom function that uses dplyr


I want to create a custom function to calculate grouped percentages in a large dataset with 100+ columns. Because I have so many columns I want to do a loop or lapply or something to avoid typing the function out 100+ times. The function I wrote works fine when I type it in individually for each column, but I cannot figure out how to do it repeatedly.

Here's a simplified dataframe and function:

# load required libraries:
library(tidyverse)

df<-data.frame(sex=c('M','M','M','F','M','F','M',NA),
              school=c('A','A','A','A','B','B','B',NA),
              question1=c(NA,1,1,2,2,3,3,3),
              question2=c(2,NA,2,4,5,1,2,3))

 my_function<-function(dataset,question_number){

  question_number_enquo<-enquo(question_number)

  dataset%>%
    filter(!is.na(!!question_number_enquo)&!is.na(sex))%>%
    group_by(school,sex,!!question_number_enquo)%>%
    count(!!question_number_enquo)%>%
    summarise(number=sum(n))%>%
    mutate(percent=number/sum(number)*100)%>%
    ungroup()
}

My function works when I type a column name into it:

my_function(df,question1)

 A tibble: 5 x 5
  school sex   question1 number percent
  <fct>  <fct>     <dbl>  <int>   <dbl>
1 A      F             2      1     100
2 A      M             1      2     100
3 B      F             3      1     100
4 B      M             2      1      50
5 B      M             3      1      50

Here's what I've tried in terms of reiteration. I want to repeat the function for every column (except for school and sex, because those are my groups).

question_col_names<-(df%>%select(-sex,-school)%>%colnames())

Using lapply with the column names as a quosure:

question_col_names_enquo<-enquo(question_col_names)
lapply(df,my_function(df,!!question_col_names_enquo))


 Error: Column `<chr>` must be length 7 (the number of rows) or one, not 2

Trying lapply with unquoted column names:

lapply(df,my_function(df,question_col_names))

Error: Column `question_col_names` is unknown

Trying lapply with quoted column names:

lapply(df,my_function(df,'question_col_names'))

Error: Column `"question_col_names"` can't be modified because it's a grouping variable

I also tried apply, and got the same types of error messages:

apply(df,1,my_function(df,!!question_col_names_enquo))
Error: Column `<chr>` must be length 7 (the number of rows) or one, not 2

apply(df,1,my_function(df,question_col_names))
Error: Column `question_col_names` is unknown

apply(df,1,my_function(df,'question_col_names'))
Error: Column `"question_col_names"` can't be modified because it's a grouping variable

I also tried different variations of a for loop:

for (i in question_col_names){
  my_function(df,i)
}

Error: Column `i` is unknown


for (i in question_col_names){
   my_function(df,'i')
 }
Error: Column `"i"` can't be modified because it's a grouping variable

How can I use iteration to get my function to repeat over all my columns?

I suspect that this has to do with dplyr; I know that it acts funny in custom functions, but I can get it to work in my function, just not in the iteration. I've done a deep dive on Google and Stack Overflow but haven't found anything that answered this.

Thanks in advance!


Solution

  • Your question_col_names are strings. You need sym to convert string to variable inside your function instead

    library(tidyverse)
    
    df <- data.frame(
      sex = c("M", "M", "M", "F", "M", "F", "M", NA),
      school = c("A", "A", "A", "A", "B", "B", "B", NA),
      question1 = c(NA, 1, 1, 2, 2, 3, 3, 3),
      question2 = c(2, NA, 2, 4, 5, 1, 2, 3)
    )
    
    my_function <- function(dataset, question_number) {
      question_number_enquo <- sym(question_number)
    
      dataset %>%
        filter(!is.na(!!question_number_enquo) & !is.na(sex)) %>%
        group_by(school, sex, !!question_number_enquo) %>%
        count(!!question_number_enquo) %>%
        summarise(number = sum(n)) %>%
        mutate(percent = number / sum(number) * 100) %>%
        ungroup()
    }
    
    my_function(df, "question1")
    #> # A tibble: 5 x 5
    #>   school sex   question1 number percent
    #>   <fct>  <fct>     <dbl>  <int>   <dbl>
    #> 1 A      F             2      1     100
    #> 2 A      M             1      2     100
    #> 3 B      F             3      1     100
    #> 4 B      M             2      1      50
    #> 5 B      M             3      1      50
    
    question_col_names <- (df %>% select(-sex, -school) %>% colnames())
    
    result <- map_df(question_col_names, ~ my_function(df, .x))
    result
    #> # A tibble: 10 x 6
    #>    school sex   question1 number percent question2
    #>    <fct>  <fct>     <dbl>  <int>   <dbl>     <dbl>
    #>  1 A      F             2      1     100        NA
    #>  2 A      M             1      2     100        NA
    #>  3 B      F             3      1     100        NA
    #>  4 B      M             2      1      50        NA
    #>  5 B      M             3      1      50        NA
    #>  6 A      F            NA      1     100         4
    #>  7 A      M            NA      2     100         2
    #>  8 B      F            NA      1     100         1
    #>  9 B      M            NA      1      50         2
    #> 10 B      M            NA      1      50         5
    

    Probably better if you convert your function result to long format

    my_function2 <- function(dataset, question_number) {
      question_number_enquo <- sym(question_number)
    
      res <- dataset %>%
        filter(!is.na(!!question_number_enquo) & !is.na(sex)) %>%
        group_by(school, sex, !!question_number_enquo) %>%
        count(!!question_number_enquo) %>%
        summarise(number = sum(n)) %>%
        mutate(percent = number / sum(number) * 100) %>%
        ungroup() %>% 
        gather(key = 'question', value, -school, -sex, -number, -percent)
      return(res)
    
    }
    
    result2 <- map_df(question_col_names, ~ my_function2(df, .x))
    result2
    #> # A tibble: 10 x 6
    #>    school sex   number percent question  value
    #>    <fct>  <fct>  <int>   <dbl> <chr>     <dbl>
    #>  1 A      F          1     100 question1     2
    #>  2 A      M          2     100 question1     1
    #>  3 B      F          1     100 question1     3
    #>  4 B      M          1      50 question1     2
    #>  5 B      M          1      50 question1     3
    #>  6 A      F          1     100 question2     4
    #>  7 A      M          2     100 question2     2
    #>  8 B      F          1     100 question2     1
    #>  9 B      M          1      50 question2     2
    #> 10 B      M          1      50 question2     5
    

    Created on 2019-11-25 by the reprex package (v0.3.0)