Search code examples
rdplyrfuzzyjoinanti-joinsemi-join

How to semi_join two dataframes by string column with one being colon-separated


I have two dataframes, dfa and dfb:

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5)
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
  id = c(6:10)
)

which look like this:

> dfa
  gene_name id
1     MUC16  1
2      MUC2  2
3       MET  3
4      FAT1  4
5      TERT  5

> dfb
  gene_name id
1      MUC1  6
2 MET; BLEP  7
3     MUC21  8
4       FAT  9
5      TERT 10

dfa is my genes of interest list: I want to keep the dfb rows where they appear, minding the digits (MUC1 is not MUC16). My new_df should look like this:

> new_df
  gene_name id
1 MET; BLEP  7
2      TERT 10

My problem is that the regular dplyr::semi_join() does exact matches, which doesn't take into account the fact that dfb$gene_names can contain genes separated with "; ". Meaning that with this example, "MET" is not retained.

I tried to look into fuzzyjoin::regex_semi_join, but I can't make it do what I want...

A tidyverse solution would be welcome. (Maybe with stringr?!)

EDIT: Follow-up question...

How would I go about to do the reciprocal anti_join? Simply changing semi_join to anti_join in this method doesn't work because the row MET; BLEP is present when it shouldn't be...

Adding a filter(gene_name == new_col) after the anti_join works with the provided simple dataset, but if I twist it a little like this:

dfa <- data.frame(
  gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
  id = c(1:5)
)

dfb <- data.frame(
  gene_name = c("MUC1", "MET; BLEP", "MUC21; BLOUB", "FAT", "TERT"),
  id = c(6:10)
)

...then it doesn't anymore. Here and in my real-life dataset, dfa doesn't contain semicolons, it's only one column of individual gene names. But dfb contains a lot of information, and multiple combinations of semicolons...


Solution

  • I think I finally managed to make fuzzyjoin::regex_joins do what I want. It was ridiculously simple, I just had to tweak my dfa filter list:

    library(fuzzyjoin)
    
    # add "\b" regex expression before/after each gene of the list to filtrate from
    # (to search for whole words)
    dfa$gene_name <- paste0("\\b", dfa$gene_name, "\\b")
    
    # to keep genes from dfb that are present in the dfa filter list
    dfb %>% 
      regex_semi_join(dfa, by = c(gene_name = "gene_name"))
    
    # to exclude genes from dfb that are present in the dfa filter blacklist
    dfb %>% 
      regex_anti_join(dfa, by = c(gene_name = "gene_name"))
    

    One drawback though: it's quite slow...