Search code examples
rdplyraggregate

r column replace by group


I am dealing with a dataset with duplicate rows like this

Id   Date        x1    a1     Col1    Col2     Col3
1    2004-11-29  1     2      0       NA       1
1    2004-11-29  1     2      1       0        0
2    2005-04-26  2     2      NA      1        0
3    2006-10-09  1     2      1       0        1
3    2006-10-09  1     2      0       0        NA

What I like to do is, among the two rows for the same ID , same date, if the value in Col1, Col2, Col3 is 1 then replace the value with 1, else its a 0 or NA if both values are missing in that column.

For example, ID 1 two rows with same date,

  • Col1 0, 1 - replace the value to 1,
  • Col2 NA,0 - replace the value to 0
  • Col3 1, 0 - replace the value to 1

Two rows with

  • 0,1 gets replaced with 1,
  • NA, NA is replaced with NA
  • NA, 0 is replaced with 0
  • NA, 1 is replaced with 1
  • 1, 1 is replaced with 1
  • 0, 0 is replaced with 0

So on

Expecting a dataset like this

Id   Date        x1    a1     Col1    Col2     Col3
1    2004-11-29  1     2      1       0       1

2    2005-04-26  2     2      NA      1        0

3    2006-10-09  1     2      1       0        1

Thanks for any help regarding this in advance.


Solution

  • If you are open to a dplyr, you could use

    library(dplyr)
    df %>% 
      group_by(Id, Date, x1, a1) %>% 
      summarise(across(Col1:Col3, ~na_if(max(coalesce(.x, -1)), -1)),
                .groups = "drop")
    

    This returns

    # A tibble: 3 x 7
         Id Date          x1    a1  Col1  Col2  Col3
      <dbl> <date>     <dbl> <dbl> <dbl> <dbl> <dbl>
    1     1 2004-11-29     1     2     1     0     1
    2     2 2005-04-26     2     2    NA     1     0
    3     3 2006-10-09     1     2     1     0     1
    

    The main idea here is to always select the max value per column and group. This is based on the assumption the values are either 0 or 1 or missing.