Search code examples
rdplyrplyrstring-matching

Concatenating rows and dropping the successive repetitions or repeating elements


I have a dataframe as follows and I would like to concatenate the rows in the sequence (drop them if there is successive repetition) based on ticket number and identify how they are handed across people.

    ticket<- c("1", "1", "1", "2", "2", "2", "2")
    name<- c("Olg", "Jan", "Jan", "Olg", "Jan", "Jan","Olg")
    df<- data.frame(ticket, name)

I want to create a column called variable called sequence which provides the paths and suppresses the successive repetitions as shown (Olg-Jan-Jan to Olg-Jan and Olg-Jan-Jan-Olg to Olg-Jan-Olg). Any suggestions? Thanks!

   seq<- c("Olg-Jan", "Olg-Jan", ""Olg-Jan", "Olg-Jan-Olg","Olg-Jan-Olg","Olg-Jan-Olg" )

Solution

  • name is a factor (and we could convert it to factor if it wasn't) so we use the underlying numeric factor codes to check for consecutive duplicates and remove them. We use dplyr so that we can easily group by ticket and chain functions together using the chaining operator (%>%).

    library(dplyr) 
    
    df %>% group_by(ticket) %>%
       filter(c(1, diff(as.numeric(name))) !=0) %>%
       summarise(sequence = paste(name, collapse="-"))
    
      ticket    sequence
    1      1     Olg-Jan
    2      2 Olg-Jan-Olg
    

    If you want to keep all the rows of the original data frame and just add the sequence, then you can left_join the output above to your original data frame:

    df = df %>% 
      left_join(df %>% group_by(ticket) %>%
                  filter(c(1, diff(as.numeric(name))) !=0) %>%
                  summarise(sequence = paste(name, collapse="-")))
    
      ticket name    sequence
    1      1  Olg     Olg-Jan
    2      1  Jan     Olg-Jan
    3      1  Jan     Olg-Jan
    4      2  Olg Olg-Jan-Olg
    5      2  Jan Olg-Jan-Olg
    6      2  Jan Olg-Jan-Olg
    7      2  Olg Olg-Jan-Olg