Let's say I have d
:
> d
pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
1 A 0 2018-12-31 1 2011-02-01 0 2018-12-31 0 2018-12-31
2 A 1 2013-10-31 0 2018-12-31 0 2018-12-31 1 2013-10-31
3 A 0 2018-12-31 0 2018-12-31 1 2016-09-30 0 2018-12-31
4 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
5 C 0 2018-12-31 0 2018-12-31 0 2018-12-31 1 2006-01-17
6 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 0 2018-12-31
d$pnr
is a unique patient identifier. age.hl
, kon.hl
, sen.hl
and mix.hl
denote different diseases and their corresponding time variable. All patients are censored at 2018-12-31
if 0 event occured.
If 1 occur in any column, it cannot occur again in the same column.
I need to merge rows, so d$pnr
only occur one time. I.e, event data/information from each column shall be merged to the same row.
I look for a solution in dplyr
.
Expected output:
pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
1 A 1 2013-10-31 1 2011-02-01 1 2016-09-30 1 2013-10-31
2 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
3 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 1 2006-01-17
Data
d <- data.frame(
pnr = c("A", "A", "A", "B", "C", "C"),
age.hl = c(0, 1, 0, 0, 0, 1),
age.hl.time = c(as.Date("2018-12-31"), as.Date("2013-10-31"), as.Date("2018-12-31"),
as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2008-07-31")),
kon.hl = c(1, 0, 0, 0, 0, 0),
kon.hl.time = c(as.Date("2011-02-01"), as.Date("2018-12-31"), as.Date("2018-12-31"),
as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2018-12-31")),
sen.hl = c(0, 0, 1, 1, 0, 1),
sen.hl.time = c(as.Date("2018-12-31"), as.Date("2018-12-31"), as.Date("2016-09-30"),
as.Date("2004-04-30"), as.Date("2018-12-31"), as.Date("2009-01-31")),
mix.hl = c(0, 1, 0, 0, 1, 0),
mix.hl.time = c(as.Date("2018-12-31"), as.Date("2013-10-31"), as.Date("2018-12-31"),
as.Date("2018-12-31"), as.Date("2006-01-17"), as.Date("2018-12-31"))
)
You can use which.max()
:
library(dplyr)
d %>%
summarise(across(ends_with("time"), ~ .x[which.max(get(sub(".time", "", cur_column())))]),
across(ends_with("hl"), max),
.by = pnr) %>%
select(names(d))
# pnr age.hl age.hl.time kon.hl kon.hl.time sen.hl sen.hl.time mix.hl mix.hl.time
# 1 A 1 2013-10-31 1 2011-02-01 1 2016-09-30 1 2013-10-31
# 2 B 0 2018-12-31 0 2018-12-31 1 2004-04-30 0 2018-12-31
# 3 C 1 2008-07-31 0 2018-12-31 1 2009-01-31 1 2006-01-17
Note: The function in the first across()
involves the values in the second across()
, so the two across()
are not exchangeable.