I am dealing with a dataset with duplicate rows like this
Id Date x1 a1 Col1 Col2 Col3
1 2004-11-29 1 2 0 NA 1
1 2004-11-29 1 2 1 0 0
2 2005-04-26 2 2 NA 1 0
3 2006-10-09 1 2 1 0 1
3 2006-10-09 1 2 0 0 NA
What I like to do is, among the two rows for the same ID , same date, if the value in Col1, Col2, Col3 is 1 then replace the value with 1, else its a 0 or NA if both values are missing in that column.
For example, ID 1 two rows with same date,
Two rows with
So on
Expecting a dataset like this
Id Date x1 a1 Col1 Col2 Col3
1 2004-11-29 1 2 1 0 1
2 2005-04-26 2 2 NA 1 0
3 2006-10-09 1 2 1 0 1
Thanks for any help regarding this in advance.
If you are open to a dplyr
, you could use
library(dplyr)
df %>%
group_by(Id, Date, x1, a1) %>%
summarise(across(Col1:Col3, ~na_if(max(coalesce(.x, -1)), -1)),
.groups = "drop")
This returns
# A tibble: 3 x 7
Id Date x1 a1 Col1 Col2 Col3
<dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2004-11-29 1 2 1 0 1
2 2 2005-04-26 2 2 NA 1 0
3 3 2006-10-09 1 2 1 0 1
The main idea here is to always select the max value per column and group. This is based on the assumption the values are either 0 or 1 or missing.