I have 2 factor columns, I want to create a third column which tells me what the second one has that the first does not.
It's very similar to this post but I'm having trouble going from a df
to using setdiff()
function.
For example:
library(dplyr)
y1 <- c("a.b.","a.","b.c.d.")
y2 <- c("a.b.c.","a.b.","b.c.d.")
df <- data.frame(y1,y2)
Column y1
has a.b.
and column y2
has a.b.c.
. I want a thirds column to return c.
or just c
.
> df
y1 y2 col3
1 a.b. a.b.c. c.
2 a. a.b. b.
3 b.c.d. b.c.d.
I think that is should be a combination of strsplit
and setdiff
, but I can't get it to work.
I've tried to convert the factor
into character
, then I've tried applying strsplit()
to the results, but the output seems a but weird to me. It seems to have created a list within a list, which makes it difficult to pass to setdiff()
#convert factor to character
df <- df %>% mutate_if(is.factor, as.character)
lapply(df$y1,function(x)(strsplit(x,split = "[.]")))
> lapply(df$y1,function(x)(strsplit(x,split = "[.]")))
[[1]]
[[1]][[1]]
[1] "a" "b"
[[2]]
[[2]][[1]]
[1] "a"
[[3]]
[[3]][[1]]
[1] "b" "c" "d"
Update
There was an issue when the difference had more than 1 character, it created an additional row. To overcome that we paste
all the elements together for each difference. This also saves us from the unlist
step.
df$col3 <- mapply(function(x, y) paste0(setdiff(y, x), collapse = ""),
strsplit(as.character(df$y1), "\\."), strsplit(as.character(df$y2), "\\."))
Original Answer
We can use mapply
and split both the columns on "." using strsplit
and then take the difference between them using setdiff
.
df$col3 <- mapply(function(x, y) setdiff(y, x),
strsplit(as.character(df$y1), "\\."), strsplit(as.character(df$y2), "\\."))
df
# y1 y2 col3
#1 a.b. a.b.c. c
#2 a. a.b. b
#3 b.c.d. b.c.d.
If we don't want col3
as list we can unlist
it however, one issue in that is if we unlist
it removes the character(0)
value from it. To retain that value we need to perform an additional check on it. Taken from here.
unlist(lapply(df$col3,function(x) if(identical(x,character(0))) ' ' else x))
#[1] "c" "b" " "