Search code examples
rdataframeduplicatesdata-manipulation

Deleting consecutively duplicated data


I have a dataframe where I wanna delete data that appear two times consecutively in two columns, (its ok if it appear two times by row)

basically I wanna go from that:

df1<-data.frame(col1=c("a","a"),col2=c("b","b"),col3=c("b","b"),col4=c("e","e"),col5=c("b","b"))

  col1 col2 col3 col4 col5
x1  a    b   b    e    b
x2  a    b   b    e    b

to that:

df2<-data.frame(col1=c("a","a"),col2=c("b","b"),col3=c(NA,NA),col4=c("e","e"),col5=c("b","b"))

  col1 col2 col3 col4 col5
x1  a    b   NA    e    b
x2  a    b   NA    e    b

or eventually that:

df3<-data.frame(col1=c("a","a"),col2=c("b","b"),col3=c("e","e"),col4=c("b","b"),col5=c(NA,NA))

  col1 col2 col3 col4 col5
x1  a   b    e    b   NA
x2  a   b    e    b   NA

you can see only data that appear two times consecutively by column disappear and not by row

here what it would look like if i expend it:

  col1 col2 col3 col4 col5 col6
x1  a   b    b    e    b    d
x2  a   b    b    e    b    d
x3  a   b    c    d    e    e

how I want it to change:

  col1 col2 col3 col4 col5 col6
x1  a   b    NA    e    b    d
x2  a   b    NA    e    b    d
x3  a   b    c     d    e   NA

df1 https://i.sstatic.net/ufUjq.png

df2 https://i.sstatic.net/gluIl.png

df3 https://i.sstatic.net/3ZH5j.png


Solution

  • On a slightly modified data set where I added a third repeated field (see if the result is right in this case), using transpose and lag to check the previous values, then transposing back also adjusting the row names.

    mod. data

    df1 <- structure(list(col1 = c("a", "a", "a"), col2 = c("b", "b", "b"
    ), col3 = c("b", "b", "c"), col4 = c("b", "e", "e"), col5 = c("c", 
    "b", "e")), row.names = c(NA, -3L), class = "data.frame")
    
    df1
      col1 col2 col3 col4 col5
    1    a    b    b    b    c
    2    a    b    b    e    b
    3    a    b    c    e    e
    
    library(dplyr)
    
    data.frame(t(df1)) %>% 
      mutate(across(everything(), ~ if_else(.x == lag(.x, default=""), NA, .x))) %>% 
      t() %>% 
      data.frame(row.names = NULL)
      col1 col2 col3 col4 col5
    1    a    b <NA> <NA>    c
    2    a    b <NA>    e    b
    3    a    b    c    e <NA>