Search code examples
rgroup-byrows

Merge rows by key and year with multiple columns in R


I have the following dataset:

df1 <- data.frame(
  "key" = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3), 
  "year" = c(2002, 2002, 2004, 2004, 2002, 2002, 2004, 2004, 2004, 2004),
  "Var1" = c(10, NA, 5, 5, 4, NA, NA, 3, 2, 2),
  "Var2" = c(1, 1, 3, 3, 2, NA, 3, NA, 1, NA),
  "Var3" = c(NA, 2, NA, NA, 5, 5, 3, NA, 2, NA),
  "Var4" = c(NA, 4, 5, 5, 6, NA, 4, NA, NA, NA))

I now want to merge the duplicate rows by key and year to have a dataset that looks like follows:

df2 <- data.frame(
  "key" = c(1, 1, 2, 2, 3), 
  "year" = c(2002, 2004, 2002, 2004, 2004),
  "Var1" = c(10, 5, 4, 3, 2),
  "Var2" = c(1, 3, 2, 3, 1),
  "Var3" = c(2, NA, 5, 3, 2),
  "Var4" = c(4, 5, 6, 4, NA))

The problem is that I have over 30 columns and hundreds to thousands of rows. Thus, this solution seems a little bit unhandy: Merge rows within a dataframe by a key. I would appreciate any help!


Solution

  • You can group_by(key, year) and get the maximum value for each column, excluding NAs and groups with only NAs:

    library(dplyr)
    df1 %>% 
      group_by(key, year) %>% 
      summarise(across(everything(), ~ ifelse(all(is.na(.x)), NA, max(.x, na.rm = T))))
    
    ## A tibble: 5 x 6
    ## Groups:   key [3]
    #    key  year  Var1  Var2  Var3  Var4
    #  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    #1     1  2002    10     1     2     4
    #2     1  2004     5     3    NA     5
    #3     2  2002     4     2     5     6
    #4     2  2004     3     3     3     4
    #5     3  2004     2     1     2    NA