Search code examples
rcoalesce

How to collapse many records into one while removing NA values


Say I have the following dataframe df

name <- c("Bill", "Rob", "Joe", "Joe")
address <- c("123 Main St", "234 Broad St", NA, "456 North Ave")
favteam <- c("Dodgers", "Mets", "Pirates", NA)

df <- data.frame(name = name, 
                 address = address,
                 favteam = favteam)
df

Which looks like:

  name       address favteam
1 Bill   123 Main St Dodgers
2  Rob  234 Broad St    Mets
3  Joe          <NA> Pirates
4  Joe 456 North Ave    <NA>

What I want to do is collapse (coalesce) rows by name (or in general, any number of grouping variables) and have any other value than NA replace the NA value in the final data, like so:

df_collapse <- foo(df)

  name   address        favteam
1 Bill   123 Main St    Dodgers
2  Rob   234 Broad St      Mets
3  Joe   456 North Ave  Pirates

Solution

  • Here's an option with dplyr:

    library(dplyr)
    
    df %>%
      group_by(name) %>%
      summarise_each(funs(first(.[!is.na(.)]))) # or summarise_each(funs(first(na.omit(.))))
    
    #Source: local data frame [3 x 3]
    #
    #  name       address favteam
    #1 Bill   123 Main St Dodgers
    #2  Joe 456 North Ave Pirates
    #3  Rob  234 Broad St    Mets
    

    And with data.table:

    library(data.table)
    setDT(df)[, lapply(.SD, function(x) x[!is.na(x)][1L]), by = name]
    #   name       address favteam
    #1: Bill   123 Main St Dodgers
    #2:  Rob  234 Broad St    Mets
    #3:  Joe 456 North Ave Pirates
    

    Or

    setDT(df)[, lapply(.SD, function(x) head(na.omit(x), 1L)), by = name]
    

    Edit:

    You say in your actual data you have varying numbers of non-NA responses per name. In that case, the following approach may be helpful.

    Consider this modified sample data (look at last row):

    name <- c("Bill", "Rob", "Joe", "Joe", "Joe")
    address <- c("123 Main St", "234 Broad St", NA, "456 North Ave", "123 Boulevard")
    favteam <- c("Dodgers", "Mets", "Pirates", NA, NA)
    
    df <- data.frame(name = name, 
                     address = address,
                     favteam = favteam)
    
    df
    #  name       address favteam
    #1 Bill   123 Main St Dodgers
    #2  Rob  234 Broad St    Mets
    #3  Joe          <NA> Pirates
    #4  Joe 456 North Ave    <NA>
    #5  Joe 123 Boulevard    <NA>
    

    Then, you can use this data.table approach to get the non-NA responses that can be varying in number by name:

    setDT(df)[, lapply(.SD, function(x) unique(na.omit(x))), by = name]
    #   name       address favteam
    #1: Bill   123 Main St Dodgers
    #2:  Rob  234 Broad St    Mets
    #3:  Joe 456 North Ave Pirates
    #4:  Joe 123 Boulevard Pirates