Search code examples
rdplyrsummarizeacross

Error with Custom Function and dplyr across()


I have a dataframe like this:

dat <- data.frame(ID=c("12345", "54321", "12345", "54321", "99999", "12345"),
                  cat1 = c(1,0,0,0,0,0),
                  cat2= c(1,1,1,1,1,1),
                  cat3= c(1,0,1,0,1,0),
                  cat4=c(0,0,0,0,0,0))

And I would like to spit out a table like this:

cat1 1
cat2 3
cat3 2
cat4 0

That counts how many unique ID's there are when the value of each column is 1. This is my code for doing this for a single column:

dat |>
  filter(cat1==1) |>
  summarise(n=n_distinct(ID))

And this works fine. But I want to be able to do this across all needed columns at once. I'm attempting to write a function and then use summarize(across()) like this:

my_function <- function(data, column) {
  data |>
    filter({{column}}==1) |>
    summarise(n=n_distinct(ID))
}

And then apply this function to all needed columns like this:

dat |>
  summarise(across(c(cat1, cat2, cat3, cat4), ~my_function(.x)))

I keep getting this error though:

Error in summarise(): ℹ In argument: across(c(cat1, cat2, cat3, cat4), ~my_function(.x)). Caused by error in across(): ! Can't compute column cat1. Caused by error in UseMethod(): ! no applicable method for 'filter' applied to an object of class "c('double', 'numeric')"

What am I doing wrong?

Thank you!


Solution

  • Expanding on my comment and providing a tidy solution...

    You already have several solutions that do what you want. The answer to your actual question (What am I doing wrong?) is that the function passed to across should take a single argument, which represents the column being manipulated. Your function has two parameters, the first of which is assumed to be the dataframe in which the column exists. Hence the error.

    This means that your current approach won't work in this case. Your summary function needs access to both the column name and its values. That's why your data frame isn't tidy. There are other ways to get what you want using summarise and across, but they won't be pretty because the code would be fighting against the fundamental assumption of the tidyverse - that it is working with tidy data.

    You can learn more about tidy data here.

    Here is a tidy approach to your problem.

    First, tidy your data:

    long_dat <- dat %>% 
      pivot_longer(
        starts_with("cat"),
        values_to = "value",
        names_to = "column"
      ) 
    long_dat
    # A tibble: 24 × 3
       ID    column value
       <chr> <chr>  <dbl>
     1 12345 cat1       1
     2 12345 cat2       1
     3 12345 cat3       1
     4 12345 cat4       0
     5 54321 cat1       0
     6 54321 cat2       1
     7 54321 cat3       0
     8 54321 cat4       0
     9 12345 cat1       0
    10 12345 cat2       1
    

    Now, summarise the observations that meet your criterion:

    long_dat %>% 
      filter(value == 1) %>% 
      group_by(column) %>% 
      summarise(n = length(unique(ID)))
    # A tibble: 3 × 2
      column     n
      <chr>  <int>
    1 cat1       1
    2 cat2       3
    3 cat3       2
    

    To obtain zero counts for categories that have no rows that meet your criterion, a little more work is required.

    long_dat %>% 
      filter(value == 1) %>% 
      group_by(column) %>% 
      summarise(n = length(unique(ID))) %>% 
      right_join(long_dat %>% distinct(column), by = "column") %>% 
      replace_na(list(n = 0))
    # A tibble: 4 × 2
      column     n
      <chr>  <int>
    1 cat1       1
    2 cat2       3
    3 cat3       2
    4 cat4       0