Search code examples
rdataframedplyrset-difference

Compare two column of list type rows with dplyr


I have the following problem:

I want to create a new column in a data frame, based in de difference between two columns, where which row is a vector of strings:

My code:

library(dplyr) # v.1.0.7

seqs <- c("seq1","seq2","seq3","seq4","seq5")
expect_mut <- c("S:T20N,S:D614G","S:T20N,S:D614G","S:T20N,N:G204R,N:G80R", "N:G204R, S:D614G", "N:G204R, S:D614G")
observed_mut <- c("S:T20N","S:D164G","S:T20N, N:G204R","S:D614G,N:G204R","S:D164G,S:T19I")

data_frame <- data.frame(seqs, expect_mut, observed_mut)
data_frame <- data_frame %>% 
  mutate(expect_mut = strsplit(as.character(expect_mut), ","),
         observed_mut = strsplit(as.character(observed_mut), ",")) %>%
  group_by(seqs) %>%
  mutate(diff_mut = setdiff(observed_mut, expect_mut))

What I expect:

| seqs  |              expect_mut            |       observed_mut      |   diff_mut   |
| ----- | ---------------------------------- | ----------------------- | ------------ |
| seq1  | c("S:T20N", "S:D614G")             | S:T20N                  |              | 
| seq2  | c("S:T20N", "S:D614G")             | S:D164G                 | S:D164G      | 
| seq3  | c("S:T20N", "N:G204R", "N:G80R")   | c("S:T20N", " N:G204R") |              | 
| seq4  | c("N:G204R", "S:D614G")            | c("N:G204R", "S:D614G") |              | 
| seq5  | c("N:G204R", "S:D614G")            | c("S:D164G", "S:T19I")  | c("S:D164G", "S:T19I") | 

What returns:

| seqs  |              expect_mut            |       observed_mut      |   diff_mut   |
| ----- | ---------------------------------- | ----------------------- | ------------ |
| seq1  | c("S:T20N", "S:D614G")             | S:T20N                  | S:T20N       | 
| seq2  | c("S:T20N", "S:D614G")             | S:D164G                 | S:D164G      | 
| seq3  | c("S:T20N", "N:G204R", "N:G80R")   | c("S:T20N", " N:G204R") | c("S:T20N", " N:G204R") | 
| seq4  | c("N:G204R", "S:D614G")            | c("N:G204R", "S:D614G") | c("N:G204R", "S:D614G") | 
| seq5  | c("N:G204R", "S:D614G")            | c("S:D164G", "S:T19I")  | c("S:D164G", "S:T19I")  | 

Basically is returning the same value of observed_mut into diff_mut column...


Solution

  • As both columns are list after the strsplit, use map2 to loop over the corresponding list elements

    library(dplyr)
    library(purrr)
    data_frame %>% 
      mutate(expect_mut = strsplit(as.character(expect_mut), ","),
             observed_mut = strsplit(as.character(observed_mut), ",")) %>% 
      mutate(diff_mut = map2(observed_mut, expect_mut, setdiff)) %>%
      as_tibble
    

    -output

    # A tibble: 5 × 4
      seqs  expect_mut observed_mut diff_mut 
      <chr> <list>     <list>       <list>   
    1 seq1  <chr [2]>  <chr [1]>    <chr [0]>
    2 seq2  <chr [2]>  <chr [1]>    <chr [1]>
    3 seq3  <chr [3]>  <chr [2]>    <chr [1]>
    4 seq4  <chr [2]>  <chr [2]>    <chr [1]>
    5 seq5  <chr [2]>  <chr [2]>    <chr [2]>
    

    Or if we use the group_by approach (assuming all elements in 'seqs' are distinct, extract the first list element with [[

    data_frame %>% 
       mutate(expect_mut = strsplit(as.character(expect_mut), ","),
             observed_mut = strsplit(as.character(observed_mut), ",")) %>%
       group_by(seqs) %>% 
       mutate(diff_mut = list(setdiff(observed_mut[[1]], expect_mut[[1]]))) %>%
       ungroup
    

    -output

    # A tibble: 5 × 4
      seqs  expect_mut observed_mut diff_mut 
      <chr> <list>     <list>       <list>   
    1 seq1  <chr [2]>  <chr [1]>    <chr [0]>
    2 seq2  <chr [2]>  <chr [1]>    <chr [1]>
    3 seq3  <chr [3]>  <chr [2]>    <chr [1]>
    4 seq4  <chr [2]>  <chr [2]>    <chr [1]>
    5 seq5  <chr [2]>  <chr [2]>    <chr [2]>
    

    NOTE: rowwise may be bug free compared to group_by (in case there are duplicates for 'seqs')