Search code examples
rdataframedplyraggregate

How to merge several rows with event data from multiple columns when based on unique character column?


Let's say I have d:

> d
  pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
1   A      0  2018-12-31      1  2011-02-01      0  2018-12-31      0  2018-12-31
2   A      1  2013-10-31      0  2018-12-31      0  2018-12-31      1  2013-10-31
3   A      0  2018-12-31      0  2018-12-31      1  2016-09-30      0  2018-12-31
4   B      0  2018-12-31      0  2018-12-31      1  2004-04-30      0  2018-12-31
5   C      0  2018-12-31      0  2018-12-31      0  2018-12-31      1  2006-01-17
6   C      1  2008-07-31      0  2018-12-31      1  2009-01-31      0  2018-12-31

d$pnr is a unique patient identifier. age.hl, kon.hl, sen.hl and mix.hl denote different diseases and their corresponding time variable. All patients are censored at 2018-12-31 if 0 event occured.

If 1 occur in any column, it cannot occur again in the same column.

I need to merge rows, so d$pnr only occur one time. I.e, event data/information from each column shall be merged to the same row.

I look for a solution in dplyr.

Expected output:

  pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
1   A      1  2013-10-31      1  2011-02-01      1  2016-09-30      1  2013-10-31
2   B      0  2018-12-31      0  2018-12-31      1  2004-04-30      0  2018-12-31
3   C      1  2008-07-31      0  2018-12-31      1  2009-01-31      1  2006-01-17

Data

d <- data.frame(
  pnr = c("A", "A", "A", "B", "C", "C"),
  age.hl = c(0, 1, 0, 0, 0, 1),
  age.hl.time = c(as.Date("2018-12-31"), as.Date("2013-10-31"), as.Date("2018-12-31"),
                  as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2008-07-31")),
  kon.hl = c(1, 0, 0, 0, 0, 0),
  kon.hl.time = c(as.Date("2011-02-01"), as.Date("2018-12-31"), as.Date("2018-12-31"),
                  as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2018-12-31")),
  sen.hl = c(0, 0, 1, 1, 0, 1),
  sen.hl.time = c(as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2016-09-30"),
                  as.Date("2004-04-30"), as.Date("2018-12-31"), as.Date("2009-01-31")),
  mix.hl = c(0, 1, 0, 0, 1, 0),
  mix.hl.time = c(as.Date("2018-12-31"), as.Date("2013-10-31"), as.Date("2018-12-31"),
                  as.Date("2018-12-31"), as.Date("2006-01-17"), as.Date("2018-12-31"))
)

Solution

  • You can use which.max():

    library(dplyr)
    
    d %>%
      summarise(across(ends_with("time"), ~ .x[which.max(get(sub(".time", "", cur_column())))]),
                across(ends_with("hl"), max),
                .by = pnr) %>%
      select(names(d))
    
    #   pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
    # 1   A      1  2013-10-31      1  2011-02-01      1  2016-09-30      1  2013-10-31
    # 2   B      0  2018-12-31      0  2018-12-31      1  2004-04-30      0  2018-12-31
    # 3   C      1  2008-07-31      0  2018-12-31      1  2009-01-31      1  2006-01-17
    

    Note: The function in the first across() involves the values in the second across(), so the two across() are not exchangeable.