Search code examples
rdplyrsummarize

Odd behavior of summarise across in DPLYR


I have two large tables (~12k x 6) based on a survey administered to children and their parent. The tables are identical in dimensions, types/classes, and were processed into R identically. After some wrangling (again, done same for children and parents) I run the following code:

UPDATE: It turns out the source of my issue is variable C which only has values 0 and 1 in the Children data set. Is there any way to get around this error when using summarise with table?

Parents %>% 
  summarise(across(A, ~ table(.x)),
            across(B, ~table(.x)),
            across(C, ~ table(.x)),
            across(D, ~ table(.x)),
            across(E, ~ table(.x)))

Children %>%  
  summarise(across(A, ~ table(.x)),
            across(B, ~table(.x)),
            across(C, ~ table(.x)),
            across(D, ~ table(.x)),
            across(E, ~ table(.x)))

For Parents I get the following output (frequency of unique values D var (1,2,3), others (0,1,2):

        A          B      C           D      E
1   11840      11835  11409       11363    519
2      35         42    436         473   4912
3       3          1     33          42   6447

For Children I get the following error:

Error: Problem with `summarise()` input `..5`.
x Input `..5` must be size 4 or 1, not 3.
ℹ An earlier column had size 4.
ℹ Input `..5` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
Run `rlang::last_error()` to see where the error occurred.

Running rlang::last_error() returns:

<error/dplyr_error>
Problem with `summarise()` input `..5`.
x Input `..5` must be size 4 or 1, not 3.
ℹ An earlier column had size 4.
ℹ Input `..5` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
Backtrace:
Run `rlang::last_trace()` to see the full context.

Running rlang::last_trace() returns:

<error/dplyr_error>
Problem with `summarise()` input `..5`.
x Input `..5` must be size 4 or 1, not 3.
ℹ An earlier column had size 4.
ℹ Input `..5` is `(function (.cols = everything(), .fns = NULL, ..., .names = NULL) ...`.
Backtrace:
     █
  1. ├─`%>%`(...)
  2. ├─dplyr::summarise(...)
  3. ├─dplyr:::summarise.data.frame(...)
  4. │ └─dplyr:::summarise_cols(.data, ...)
  5. │   └─base::withCallingHandlers(...)
  6. ├─dplyr:::abort_glue(...)
  7. │ ├─rlang::exec(abort, class = class, !!!data)
  8. │ └─(function (message = NULL, class = NULL, ..., trace = NULL, parent = NULL, ...
  9. │   └─rlang:::signal_abort(cnd)
 10. │     └─base::signalCondition(cnd)
 11. └─(function (e) ...

Does anyone have any idea what might be happening?

For sanity sake, here are the str summaries:

> str(Parents)
'data.frame':   11878 obs. of  6 variables:
 $ ID         : chr  "Parent 1" "Parent 2" "Parent 3" "Parent 4" ...
 $ A          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ B          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ C          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ D          : num  2 2 1 2 3 3 2 3 3 2 ...
 $ E          : num  0 0 0 0 0 0 0 0 0 0 ...
> str(Children)
'data.frame':   11878 obs. of  6 variables:
 $ ID         : chr  "Child 1" "Child 2" "Child 3" "Child 4" ...
 $ A          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ B          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ C          : num  0 0 0 0 0 0 0 0 0 0 ...
 $ D          : num  2 2 1 2 3 3 2 3 3 2 ...
 $ E          : num  0 0 0 0 0 0 0 0 0 0 ...

Solution

  • table will not necessarily fit in tidyverse pipeline always since it returns unequal number of values. I think it would be better to get the data in long format and use count. You'll get the same information but in long format.

    library(dplyr)
    library(tidyr)
    
    Parents %>%  pivot_longer(cols = A:E) %>% count(name, value)
    

    The same should work for Children data.