I have a dataframe where I wanna delete data that appear two times consecutively in two columns, (its ok if it appear two times by row)
basically I wanna go from that:
df1<-data.frame(col1=c("a","a"),col2=c("b","b"),col3=c("b","b"),col4=c("e","e"),col5=c("b","b"))
col1 col2 col3 col4 col5
x1 a b b e b
x2 a b b e b
to that:
df2<-data.frame(col1=c("a","a"),col2=c("b","b"),col3=c(NA,NA),col4=c("e","e"),col5=c("b","b"))
col1 col2 col3 col4 col5
x1 a b NA e b
x2 a b NA e b
or eventually that:
df3<-data.frame(col1=c("a","a"),col2=c("b","b"),col3=c("e","e"),col4=c("b","b"),col5=c(NA,NA))
col1 col2 col3 col4 col5
x1 a b e b NA
x2 a b e b NA
you can see only data that appear two times consecutively by column disappear and not by row
here what it would look like if i expend it:
col1 col2 col3 col4 col5 col6
x1 a b b e b d
x2 a b b e b d
x3 a b c d e e
how I want it to change:
col1 col2 col3 col4 col5 col6
x1 a b NA e b d
x2 a b NA e b d
x3 a b c d e NA
df1 https://i.sstatic.net/ufUjq.png
On a slightly modified data set where I added a third repeated field (see if the result is right in this case), using t
ranspose and lag
to check the previous values, then transposing back also adjusting the row names.
df1 <- structure(list(col1 = c("a", "a", "a"), col2 = c("b", "b", "b"
), col3 = c("b", "b", "c"), col4 = c("b", "e", "e"), col5 = c("c",
"b", "e")), row.names = c(NA, -3L), class = "data.frame")
df1
col1 col2 col3 col4 col5
1 a b b b c
2 a b b e b
3 a b c e e
library(dplyr)
data.frame(t(df1)) %>%
mutate(across(everything(), ~ if_else(.x == lag(.x, default=""), NA, .x))) %>%
t() %>%
data.frame(row.names = NULL)
col1 col2 col3 col4 col5
1 a b <NA> <NA> c
2 a b <NA> e b
3 a b c e <NA>