Search code examples
rdataframefindcharacteroverlap

How to match string/character variables in a data table in R, then print into another column?


I have a data table containing a specific set of genes in one column and another set of significant genes in another column on my table. Both are character variables. How do I find the overlap of these genes and print into another column?

Example:

a <- c('apple banana melon pear ', 'pear kiwi pineapple', 'avocado lime kiwi apple', 'lime pineapple banana melon')
b <- c('blah blah blah banana pear', 'blah pear blah blah kiwi', 'blah blah blah apple', 'lime blah blah blah')
df <- data.frame(a,b)

What I want to return is df$new_column of c('banana pear', 'pear kiwi', 'apple', 'lime')

I have tried:

df$new_column<- df$a[df$a %in% df$b], but I am getting the error message

Error in `$<-.data.frame`(`*tmp*`, new_column, value = character(0)) : 
  replacement has 0 rows, data has 4

Solution

  • Those strings have to be separated into words first, then we can use intersect() on pairs of those sets.

    With base R perhaps something like this:

    df <- data.frame(a,b)
    # split strings and find intersections, paste back together
    df$new_column <- mapply(\(a,b) paste(intersect(a,b), collapse = " ") ,
                            strsplit(df$a, " ",),
                            strsplit(df$b, " ",))
    df
    #>                             a                          b  new_column
    #> 1    apple banana melon pear  blah blah blah banana pear banana pear
    #> 2         pear kiwi pineapple   blah pear blah blah kiwi   pear kiwi
    #> 3     avocado lime kiwi apple       blah blah blah apple       apple
    #> 4 lime pineapple banana melon        lime blah blah blah        lime
    
    # all values are just plain strings:
    str(df)
    #> 'data.frame':    4 obs. of  3 variables:
    #>  $ a         : chr  "apple banana melon pear " "pear kiwi pineapple" "avocado lime kiwi apple" "lime pineapple banana melon"
    #>  $ b         : chr  "blah blah blah banana pear" "blah pear blah blah kiwi" "blah blah blah apple" "lime blah blah blah"
    #>  $ new_column: chr  "banana pear" "pear kiwi" "apple" "lime"
    

    Alternatively:

    library(dplyr, warn.conflicts = F)
    library(stringr)
    library(purrr)
    
    # with Tidyverse and list columns:
    df_lc <- df %>% mutate(across(c(a,b), ~ str_split(.x, " "))) %>% 
      mutate(new_col = map2(a,b, ~ intersect(.x,.y)))
    
    # now we have list columns:
    df_lc["new_col"]
    #>        new_col
    #> 1 banana, pear
    #> 2   pear, kiwi
    #> 3        apple
    #> 4         lime
    
    # when printing a tibble it's bit more evident:
    as_tibble(df_lc)
    #> # A tibble: 4 × 4
    #>   a         b         new_column  new_col  
    #>   <list>    <list>    <chr>       <list>   
    #> 1 <chr [5]> <chr [5]> banana pear <chr [2]>
    #> 2 <chr [3]> <chr [5]> pear kiwi   <chr [2]>
    #> 3 <chr [4]> <chr [4]> apple       <chr [1]>
    #> 4 <chr [4]> <chr [4]> lime        <chr [1]>
    
    str(df_lc)
    #> 'data.frame':    4 obs. of  4 variables:
    #>  $ a         :List of 4
    #>   ..$ : chr  "apple" "banana" "melon" "pear" ...
    #>   ..$ : chr  "pear" "kiwi" "pineapple"
    #>   ..$ : chr  "avocado" "lime" "kiwi" "apple"
    #>   ..$ : chr  "lime" "pineapple" "banana" "melon"
    #>  $ b         :List of 4
    #>   ..$ : chr  "blah" "blah" "blah" "banana" ...
    #>   ..$ : chr  "blah" "pear" "blah" "blah" ...
    #>   ..$ : chr  "blah" "blah" "blah" "apple"
    #>   ..$ : chr  "lime" "blah" "blah" "blah"
    #>  $ new_column: chr  "banana pear" "pear kiwi" "apple" "lime"
    #>  $ new_col   :List of 4
    #>   ..$ : chr  "banana" "pear"
    #>   ..$ : chr  "pear" "kiwi"
    #>   ..$ : chr "apple"
    #>   ..$ : chr "lime"
    

    Input:

    a <- c('apple banana melon pear ', 'pear kiwi pineapple', 'avocado lime kiwi apple', 'lime pineapple banana melon')
    b <- c('blah blah blah banana pear', 'blah pear blah blah kiwi', 'blah blah blah apple', 'lime blah blah blah')
    

    Created on 2023-01-20 with reprex v2.0.2