Search code examples
rdplyrgroup-byaggregate

Collapse data by ID in data frame in R


I am try to collapse data by ID and Gender. I have the data frame:

ID <- c(1,1,1,1,2,2,2,3,3,3,4,4,4)
Gender <- c("M","M","M","M","F","F",'F',"F","F","F", "M", "M", "M")
Test1 <- c("70", "NA", "NA", "NA", "NA", "85", "NA", "NA", "90", "NA", "NA", "NA", "90")
Test2 <- c("NA", "60", "NA", "NA", "NA", "NA", "82", "NA", "NA", "87", "NA", "88", "NA")

df <- data.frame(ID, Gender, Test1, Test2)

   ID Gender Test1 Test2
1   1      M    70    NA
2   1      M    NA    60
3   1      M    NA    NA
4   1      M    NA    NA
5   2      F    NA    NA
6   2      F    85    NA
7   2      F    NA    82
8   3      F    NA    NA
9   3      F    90    NA
10  3      F    NA    87
11  4      M    NA    NA
12  4      M    NA    88
13  4      M    90    NA

I am hoping to get help on how to collapse the data across ID and Gender so I can have 1 row per ID. Which would look like this:

  id gender test1 test2
1  1      M    70    60
2  2      F    85    82
3  3      F    90    87
4  4      M    90    88

Any help would be greatly appreciated! Thanks!


Solution

  • # convert the type from factor to integer
    # this step is not necessary if you create these columns of type integer
    df$Test1 <- as.integer(as.character(df$Test1))
    df$Test2 <- as.integer(as.character(df$Test2))
    
    # choose non-NA value for each (ID, gender) combination
    # the function max is interchangeable, you just need the NA treatment 
    df %>%
      group_by(ID, Gender) %>%
      summarise(
        Test1 = max(Test1, na.rm = T),
        Test2 = max(Test2, na.rm = T)
      )
    
    # # A tibble: 4 x 4
    # # Groups:   ID [?]
    #      ID Gender Test1 Test2
    #   <dbl> <fct>  <int> <int>
    # 1     1 M         70    60
    # 2     2 F         85    82
    # 3     3 F         90    87
    # 4     4 M         90    88
    

    With some type adjustments:

    # create the example data with suitable column types
    df <- data_frame(
      ID = c(rep(1, 4), rep(2:4, each = 3)),
      Gender = c(rep("M", 4), rep("F", 6), rep("M", 3)),
      Test1 = c(70, rep(NA, 4), 85, rep(NA, 2), 90, rep(NA, 3), 90),
      Test2 = c(NA, 60, rep(NA, 4), 82, rep(NA, 2), 87, NA, 88, NA)
    )
    
    df %>%
      group_by(ID, Gender) %>%
      summarise(
        Test1 = max(Test1, na.rm = T),
        Test2 = max(Test2, na.rm = T)
      )