Search code examples
rdplyrsummaryacross

Summarise across list columns, get first value, in dplyr


I have a large dataframe with some semi-duplicate entries I'm trying to merge via dplyr::summarise. This works fine for numeric, character, & logical columns, but not list columns.

mydata <- tibble(A = c(1,1,2,2,3,3),
                 B = c(1,NA,4,5,7,7),
                 C = list(1:3, 1:3, 2:4, 2:4, 3:6, 3:6))

dedupedData <- mydata %>%
  group_by(A) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE))

Works as expected;

dedupedData <- mydata %>%
  group_by(A) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE),
            across(where(is.list), first))

Fails differently to my dataset. The above creates a duplicate number of rows by spreading out the ranges of C, i.e. group A1 gets 1:3, A2 gets 2:4, A3 gets 3:6. For my data:

dedupedData <- mydata %>%
  group_by(A, B) %>%
  summarise(across(where(is.numeric), mean, na.rm = TRUE),
            across(where(is.character), first),
            across(where(is.logical), first),
            across(where(is.POSIXct), first),
            across(where(is.list), first))

Works until I include the is.list line, whereupon it breaks with:

Error: Problem with summarise() input ..5. x 'names' attribute [11] must be the same length as the vector [9]

Does anyone know how to solve this? When grouped one would expect the list items to be split up as the other columns are, so the C values for group A1 should be

mydata$C[1:2]

[1] 1 1 2 3

[[2]] 1 1 2 3

(the first list item is double square brackets in R but not here for some reason)

And thus first(mydata$C[1:2])

1 1 2 3

Which looks right, I just need to have that sent to the target cell.

Also do I need to chain the across lines explicitly that way? I tried !is.numeric, and also is.POSIXct | is.character | is.logical | is.list.

Thanks. Relatedly if anyone knows how to create column lists which aren't ranges that would be good - I've managed to do this through grouped summaries before, but for this reprex I copied the R For Data Science example, but it doesn't seem generalisable, i.e. if you change the last line of mydata to

C = list(c(1,2,3), 1:3, 2:4, 2:4, 3:6, 3:6))

Then C1 will be "c(1,2,3)" rather than evaluating the comma separated terms, seemingly disallowing anything that's not a single value or range?

Cheers!


Solution

  • We group by 'A', 'B', get the first element of the list, and wrap it in a list in across

    library(dplyr)
    out <- mydata %>%
      group_by(A, B) %>% 
      summarise(across(where(is.numeric), mean, na.rm = TRUE), 
                across(where(is.list),  ~ list(first(.)))) 
    

    -output

    out
    # A tibble: 5 x 3
    # Groups:   A [3]
    #      A     B C        
    #  <dbl> <dbl> <list>   
    #1     1     1 <int [3]>
    #2     1    NA <int [3]>
    #3     2     4 <int [3]>
    #4     2     5 <int [3]>
    #5     3     7 <int [4]>
    

    If we want to get the first element, another option is slice

    mydata %>% 
         group_by(A, B) %>%
         slice(1)