Search code examples
rdataframemergereshapestring-concatenation

R - Concatenate cell in dataframe, by group, depending on another cell value


I have a dataset of the following type (first row is the header):

  • content is always text
  • merge is always a logical
id1  id2  start_line end_line content           merge
A    B    1          1        "aaaa"            TRUE
A    B    4          4        "aa mm"           TRUE
A    B    5          5        "boool"           TRUE
A    B    6          6        "omw"             TRUE
C    D    6          6        "hear!"           TRUE
C    D    7          7        " me out!"        TRUE
C    D    21         21       "hello"           FALSE

Problem: I need to merge following a very specific criteria:

  • Rows that have merge = FALSE must remain as is
  • Rows that have: same id1, same id2 and consecutive start_line:
    • Need to be appended on the column content
    • The end_line value needs to change to the last row

So, the expected result would be:

id1  id2  start_line end_line content             merge
A    B    1          1        "aaaa"              TRUE
A    B    4          6        "aa mm boool omw"   TRUE
C    D    6          7        "hear!  me out!"    TRUE
C    D    21         21       "hello"             FALSE

Notice in the example that:

  • The minimal merge is with two lines (example of ids: C-D, originally 6th and 7th rows)
  • There can be multiple lines to merge (example of ids A-B, originally rows 2nd, 3rd, 4th)

I have attempted a very large, and inefficient series of loops, that only merge two lines. That is why I am not posting my attempt here.


Solution

  • Using dplyr you can try :

    library(dplyr)
    
    df %>%
     group_by(id1, id2, grp = cumsum(c(TRUE, diff(start_line) > 1))) %>%
     summarise(start_line = first(start_line), 
               end_line = last(end_line), 
               content = paste(content, collapse = " "), 
                merge = any(merge))
    
    
    #  id1   id2     grp start_line end_line content         merge
    #  <chr> <chr> <int>      <int>    <int> <chr>           <lgl>
    #1 A     B         1          1        1 aaaa            TRUE 
    #2 A     B         2          4        6 aa mm boool omw TRUE 
    #3 C     D         2          6        7 hear!  me out!  TRUE 
    #4 C     D         3         21       21 hello           FALSE
    

    data

    df <- structure(list(id1 = c("A", "A", "A", "A", "C", "C", "C"), id2 = c("B", 
    "B", "B", "B", "D", "D", "D"), start_line = c(1L, 4L, 5L, 6L, 
    6L, 7L, 21L), end_line = c(1L, 4L, 5L, 6L, 6L, 7L, 21L), content = c("aaaa", 
    "aa mm", "boool", "omw", "hear!", " me out!", "hello"), merge = c(TRUE, 
    TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)), class = "data.frame", 
    row.names = c(NA, -7L))