I'm struggling with a seemingly simple task: in df
I want to summarise data in words
into strings (no problem) and also the respective Duration
s of these words
(problem). My attempt at subsetting Duration
on the condition that [!is.na(words)]
fails; instead all values in Duration
get stringed together:
NB: I do not want to filter the dataframe on those rows only where words
are not NA because in my actual data frame I have many more columns that I need to summarise in a similar way.
library(tidyverse)
df %>%
summarise(
words = str_c(words[!is.na(words)], collapse = ","),
words_dur = str_c(Duration[!is.na(words)], collapse = ",")
)
words words_dur
1 hey,how,are,you 44,150,30,55,77,80,99,100,200
The expected output is this:
words words_dur
1 hey,how,are,you 44,55,77,99
Data:
df <- data.frame(
words = c("hey", NA, NA, "how", "are", NA, "you", NA, NA),
Duration = c(44, 150, 30, 55, 77, 80, 99, 100, 200)
)
The problem here is that you have already re-defined words
in the first calculation in summarise()
. At this point words
is "hey,how,are,you"
so !is.na(words)
is equal to TRUE
which gets recycled and returns all values from Duration
. Either change the name of the words
variable in the first calculation, e.g. words2 = str_c(words[!is.na(words)], collapse = ",")
or reverse the order of the calculations:
df %>%
summarise(
words_dur = str_c(Duration[!is.na(words)], collapse = ","),
words = str_c(words[!is.na(words)], collapse = ",")
)
# words_dur words
# 1 44,55,77,99 hey,how,are,you