Search code examples
rdplyrgroup-byaggregatedata-wrangling

How to summarise a dataframe retaining all the columns


Consider the following dataframe:

dummy_df <- tibble(
  A=c("ABC", "ABC", "BCD", "CDF", "CDF", "CDF"),
  B=c(0.25, 0.25, 1.23, 0.58, 0.58, 0.58),
  C=c("lorem", "ipsum", "dolor", "amet", "something", "else"),
  D=c("up", "up", "down", "down", "down", "down"),
  E=c(132, 132, 243, 512, 512, 512),
  F=c("m1", "m1", "m5", "m3", "m3", "m3"),
  G=c("val", "val", "mur", "mad", "mad", "mad"),
  H=c("grx", "grx", "bcn", "sal", "sal", "sal"),
  I=c(1.68, 1.68, 2.31, 3.12, 3.12, 3.12),
  J=c("p", "p", "f", "p", "p", "p"),
  K=c(100, 100, 200, 143, 143, 143)
)

# A tibble: 6 × 11
  A         B C         D         E F     G     H         I J         K
  <chr> <dbl> <chr>     <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 ABC    0.25 lorem     up      132 m1    val   grx    1.68 p       100
2 ABC    0.25 ipsum     up      132 m1    val   grx    1.68 p       100
3 BCD    1.23 dolor     down    243 m5    mur   bcn    2.31 f       200
4 CDF    0.58 amet      down    512 m3    mad   sal    3.12 p       143
5 CDF    0.58 something down    512 m3    mad   sal    3.12 p       143
6 CDF    0.58 else      down    512 m3    mad   sal    3.12 p       143

After reading this, I have managed to collapse column C so that its value is concatenated into a string for every unique row value of column A.

dummy_df %>% group_by(A) %>% summarise(hits = toString(C), nhits=n())

# A tibble: 3 × 3
  A     hits                  nhits
  <chr> <chr>                 <int>
1 ABC   lorem, ipsum              2
2 BCD   dolor                     1
3 CDF   amet, something, else     3

However, I'm loosing every other column information, which are essential to me. How can I retain information about all the columns while collapsing column C? Ideally it should be done without having to hard-code the columns names, since the number of columns can vary depending on the dataset.

I have read this, but the example shown doesn't create new variables so I haven't been able to make it work.

This is what I'm looking for:

# A tibble: 3 × 12
  A     hits                  nhits     B D         E F     G     H         I J         K
  <chr> <chr>                 <int> <dbl> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 ABC   lorem, ipsum              2  0.25 up      132 m1    val   grx    1.68 p       100
2 BCD   dolor                     1  1.23 down    243 m5    mur   bcn    2.31 f       200
3 CDF   amet, something, else     3  0.58 down    512 m3    mad   sal    3.12 p       143

Solution

  • If we assume that all fields other than C will contain repeating values for each A-group, then we can do this:

    dummy_df %>%
      group_by(A) %>%
      summarize(
        nhits = n(),
        across(where(is.character), ~ toString(unique(.))),
        across(where(~ !is.character(.)), ~ unique(.))
      ) %>%
      ungroup()
    # # A tibble: 3 × 12
    #   A     nhits C                     D     F     G     H     J         B     E     I     K
    #   <chr> <int> <chr>                 <chr> <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
    # 1 ABC       2 lorem, ipsum          up    m1    val   grx   p      0.25   132  1.68   100
    # 2 BCD       1 dolor                 down  m5    mur   bcn   f      1.23   243  2.31   200
    # 3 CDF       3 amet, something, else down  m3    mad   sal   p      0.58   512  3.12   143
    

    In this case, since all other variables are repeating, everything stays as-is. If, however, we have variability, then I've set up a catch, where strings are combined (with toString) and differing numbers cause repeat rows.

    dummy_df$H[1] <- "GRX"
    dummy_df$K[1] <- 99
     %>%
      group_by(A) %>%
      summarize(
        nhits = n(),
        across(where(is.character), ~ toString(unique(.))),
        across(where(~ !is.character(.)), ~ unique(.))
      ) %>%
      ungroup()
    # A tibble: 4 × 12
      A     nhits C                     D     F     G     H        J         B     E     I     K
      <chr> <int> <chr>                 <chr> <chr> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>
    1 ABC       2 lorem, ipsum          up    m1    val   GRX, grx p      0.25   132  1.68    99
    2 ABC       2 lorem, ipsum          up    m1    val   GRX, grx p      0.25   132  1.68   100
    3 BCD       1 dolor                 down  m5    mur   bcn      f      1.23   243  2.31   200
    4 CDF       3 amet, something, else down  m3    mad   sal      p      0.58   512  3.12   143
    

    The across function iterates over multiple columns. The where function allows us to subset columns by something about their values, where for character columns we will apply toString, and for others we'll go with unique. This means that string columns should never add rows, but non-strings may.

    An alternative for the number-like columns would be to use first(.) instead of unique(.), which will silently discard all but the first value in each column by group. Using the same modified data, we would see this, where the key difference is that K has discarded the 100 values since the first value is 99.

    dummy_df %>%
      group_by(A) %>%
      summarize(
        nhits = n(), 
        across(where(is.character), ~ toString(unique(.))),
        across(where(~ !is.character(.)), ~ first(.))
      ) %>%
      ungroup()
    # # A tibble: 3 × 12
    #   A     nhits C                     D     F     G     H        J         B     E     I     K
    #   <chr> <int> <chr>                 <chr> <chr> <chr> <chr>    <chr> <dbl> <dbl> <dbl> <dbl>
    # 1 ABC       2 lorem, ipsum          up    m1    val   GRX, grx p      0.25   132  1.68    99
    # 2 BCD       1 dolor                 down  m5    mur   bcn      f      1.23   243  2.31   200
    # 3 CDF       3 amet, something, else down  m3    mad   sal      p      0.58   512  3.12   143
    

    You can choose to use first with string columns as well, if you choose, which could simplify the logic to a single across:

    dummy_df %>%
      group_by(A) %>%
      summarize(
        nhits = n(),
        across(everything(), ~ first(.))
      ) %>%
      ungroup()
    # # A tibble: 3 × 12
    #   A     nhits     B C     D         E F     G     H         I J         K
    #   <chr> <int> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <dbl>
    # 1 ABC       2  0.25 lorem up      132 m1    val   GRX    1.68 p        99
    # 2 BCD       1  1.23 dolor down    243 m5    mur   bcn    2.31 f       200
    # 3 CDF       3  0.58 amet  down    512 m3    mad   sal    3.12 p       143
    

    Note that we've lost H's "grx" (since "GRX" was first) as well as K's value of 100.

    The data.table variant of this last code-block can be simply:

    library(data.table)
    as.data.table(dummy_df)[, c(.(nhits = .N), lapply(.SD, first)), by = A]
    

    and a base R rendition? Perhaps not as "pretty" :-)

    Reduce(
      function(a, b) merge(a, b, by = "A", all = TRUE), 
      list(
        setNames(aggregate(C ~ A, dummy_df, FUN = length), c("A", "nhits")), 
        aggregate(C ~ A, dummy_df, FUN = toString),
        aggregate(. ~ A, subset(dummy_df, select = -C), FUN = function(z) z[1])
      )
    )
    #     A nhits                     C    B    D   E  F   G   H    I J   K
    # 1 ABC     2          lorem, ipsum 0.25   up 132 m1 val GRX 1.68 p  99
    # 2 BCD     1                 dolor 1.23 down 243 m5 mur bcn 2.31 f 200
    # 3 CDF     3 amet, something, else 0.58 down 512 m3 mad sal 3.12 p 143