Search code examples
rdplyrstringrdifftime

Group and rename based on two conditions in R (dplyr)


I have a dataset, df:

Ultimately I would like to be able to group the data into 'chunks' where the Folder column contains the string 'Out', making sure to consider the DATE and the empty Message values that it is associated with. Is there a way to create a chunk for each instance that 'Out' and an empty Message row occurs, while computing its duration.

  Folder               DATE                         Message
  Outdata              9/9/2019 5:46:00                   
  Outdata              9/9/2019 5:46:01
  Outdata              9/9/2019 5:46:02
  In                   9/9/2019 5:46:03            hello
  In                   9/9/2019 5:46:04            hello
  Outdata              9/10/2019 6:00:01
  Outdata              9/10/2019 6:00:02
  In                   9/11/2019 7:50:00           hello
  In                   9/11/2019 7:50:01           hello

I would like this output:

 New Variable        Duration        Message
 Outdata1              2 sec
 Outdata2              1 sec

I have included the dput:

dput(sample)
structure(list(Folder = structure(c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 
1L, 1L), .Label = c("In", "Outdata"), class = "factor"), Date = structure(c(5L, 
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L), .Label = c("9/10/2019 6:00:01 AM", 
"9/10/2019 6:00:02 AM", "9/11/2019 7:50:00 AM", "9/11/2019 7:50:01 AM", 
"9/9/2019 5:46:00 AM", "9/9/2019 5:46:01 AM", "9/9/2019 5:46:02 AM", 
"9/9/2019 5:46:03 AM", "9/9/2019 5:46:04 AM"), class = "factor"), 
Message = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L), .Label = c("", 
"hello"), class = "factor")), class = "data.frame", row.names = c(NA, 
-9L))

This is what I have tried, which works well, I just need to account for the condition of the empty Message values as well.

  library(dplyr)

  df  %>%
  mutate(DATE = as.POSIXct(DATE, format = "%m/%d/%Y %I:%M:%S %p"), 
     gr = cumsum(Folder != lag(Folder, default = TRUE))) %>%
 filter(Folder == "Out") %>%
 arrange(gr, DATE) %>%
 group_by(gr) %>%
 summarise(Duration = difftime(last(DATE), first(DATE), units = "secs")) %>%
 mutate(gr = paste0('Out', row_number()))

The above code works fine, but I am not sure how to satisfy the condition where row == ""


Solution

  • Probably, just paste the Message together in one string.

    library(dplyr)
    
    sample  %>%
      mutate(DATE = as.POSIXct(Date, format = "%m/%d/%Y %I:%M:%S %p"), 
             gr = cumsum(Folder != lag(Folder, default = TRUE))) %>%
      filter(Folder == "Outdata") %>%
      arrange(gr, DATE) %>%
      group_by(gr) %>%
      summarise(Duration = difftime(last(DATE), first(DATE), units = "secs"), 
                Message = paste0(Message, collapse = "")) %>%
      mutate(gr = paste0('Out', row_number()))