Search code examples
rdplyrconcatenationna

Concatenate string field using dplyr when some values are NA


I have a dataframe of data containing a comments field. In some rows of the data there are single rows with no comment (NA in the comment field). There are some locations in the data that have more than one row that may or may not contain a comment.

The data resembles this structure (although with more fields):

input <- data.frame(
  stringsAsFactors = FALSE,
          Location = c(1L, 1L, 1L, 2L, 2L, 3L, 4L),
           Comment = c("This is a comment", NA, "This is another comment", "This is a comment", NA, "This is a comment", NA)
)
Location  Comment
1         This is a comment
1         NA
1         This is another comment
2         This is a comment
2         NA
3         This is a comment
4         NA

I can concatenate this using group and summarise like so:

output <- input %>%
  group_by(Location) %>%
  summarise(Comment = paste(Comment, collapse = " | "))

But this converts the NA values to strings.

Location  Comment
1         "This is a comment | NA | This is another comment"
2         "This is a comment | NA"
3         "This is a comment"
4         "NA"

But the output I really want from the process would exclude NAs from the final comment, unless the only comment for a location was NA

outputDesired <- data.frame(
  stringsAsFactors = FALSE,
          Location = c(1L, 2L, 3L, 4L),
          Comment = c("This is a comment | This is another comment", "This is a comment", "This is a comment", NA)
)
Location  Comment
1         This is a comment | This is another comment
2         This is a comment
3         This is a comment
4         NA

I can easily convert the "NA" text in location 4 to an actual NA value, and I'm thinking of removing an " | NA" if it exists, but could do with some help in sticking it into a case_when statement something like:

output <- input %>%
  group_by(Location) %>%
  summarise(Comment = paste(Comment, collapse = " | ")) %>%
  mutate(Comment = case_when(
    Comment == "NA" ~ NA,
    Comment ... (contains " | NA") ~ (remove pattern)
  ))

Ideally, though, it would be better if I could ignore NA comments in the first place, but keep all locations in the final output.

Note that in real life this is part of a larger dplyr pipe, so I'd prefer a tidyverse solution, but happy to explore other options.

Any ideas?


Solution

  • You can use na.omit to drop the NA values, na_if would change the empty values to NA.

    library(dplyr)
    
    input %>%
      group_by(Location) %>%
      summarise(Comment = na_if(paste0(na.omit(Comment), collapse = '|'), ''))
    
    #  Location Comment                                  
    #     <int> <chr>                                    
    #1        1 This is a comment|This is another comment
    #2        2 This is a comment                        
    #3        3 This is a comment                        
    #4        4 NA