Search code examples
rdplyr

Summarise into strings based on condition from other column


I'm struggling with a seemingly simple task: in df I want to summarise data in words into strings (no problem) and also the respective Durations of these words (problem). My attempt at subsetting Duration on the condition that [!is.na(words)] fails; instead all values in Duration get stringed together:

NB: I do not want to filter the dataframe on those rows only where words are not NA because in my actual data frame I have many more columns that I need to summarise in a similar way.

library(tidyverse)
df %>%
  summarise(
    words = str_c(words[!is.na(words)], collapse = ","),
    words_dur = str_c(Duration[!is.na(words)], collapse = ",")
  )
            words                     words_dur
1 hey,how,are,you 44,150,30,55,77,80,99,100,200

The expected output is this:

            words      words_dur
1 hey,how,are,you    44,55,77,99

Data:

df <- data.frame(
  words = c("hey", NA, NA, "how", "are", NA, "you", NA, NA),
  Duration = c(44, 150, 30, 55, 77, 80, 99, 100, 200)
)

Solution

  • The problem here is that you have already re-defined words in the first calculation in summarise(). At this point words is "hey,how,are,you" so !is.na(words) is equal to TRUE which gets recycled and returns all values from Duration. Either change the name of the words variable in the first calculation, e.g. words2 = str_c(words[!is.na(words)], collapse = ",") or reverse the order of the calculations:

    df %>%
      summarise(
        words_dur = str_c(Duration[!is.na(words)], collapse = ","),
        words = str_c(words[!is.na(words)], collapse = ",")
      )
    
    #     words_dur           words
    # 1 44,55,77,99 hey,how,are,you