I have two dataframes, dfa
and dfb
:
dfa <- data.frame(
gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
id = c(1:5)
)
dfb <- data.frame(
gene_name = c("MUC1", "MET; BLEP", "MUC21", "FAT", "TERT"),
id = c(6:10)
)
which look like this:
> dfa
gene_name id
1 MUC16 1
2 MUC2 2
3 MET 3
4 FAT1 4
5 TERT 5
> dfb
gene_name id
1 MUC1 6
2 MET; BLEP 7
3 MUC21 8
4 FAT 9
5 TERT 10
dfa
is my genes of interest list: I want to keep the dfb
rows where they appear, minding the digits (MUC1
is not MUC16
). My new_df
should look like this:
> new_df
gene_name id
1 MET; BLEP 7
2 TERT 10
My problem is that the regular dplyr::semi_join()
does exact matches, which doesn't take into account the fact that dfb$gene_names
can contain genes separated with "; "
. Meaning that with this example, "MET"
is not retained.
I tried to look into fuzzyjoin::regex_semi_join
, but I can't make it do what I want...
A tidyverse solution would be welcome. (Maybe with stringr
?!)
EDIT: Follow-up question...
How would I go about to do the reciprocal anti_join
? Simply changing semi_join
to anti_join
in this method doesn't work because the row MET; BLEP
is present when it shouldn't be...
Adding a filter(gene_name == new_col)
after the anti_join
works with the provided simple dataset, but if I twist it a little like this:
dfa <- data.frame(
gene_name = c("MUC16", "MUC2", "MET", "FAT1", "TERT"),
id = c(1:5)
)
dfb <- data.frame(
gene_name = c("MUC1", "MET; BLEP", "MUC21; BLOUB", "FAT", "TERT"),
id = c(6:10)
)
...then it doesn't anymore. Here and in my real-life dataset, dfa
doesn't contain semicolons, it's only one column of individual gene names. But dfb
contains a lot of information, and multiple combinations of semicolons...
I think I finally managed to make fuzzyjoin::regex_joins
do what I want. It was ridiculously simple, I just had to tweak my dfa
filter list:
library(fuzzyjoin)
# add "\b" regex expression before/after each gene of the list to filtrate from
# (to search for whole words)
dfa$gene_name <- paste0("\\b", dfa$gene_name, "\\b")
# to keep genes from dfb that are present in the dfa filter list
dfb %>%
regex_semi_join(dfa, by = c(gene_name = "gene_name"))
# to exclude genes from dfb that are present in the dfa filter blacklist
dfb %>%
regex_anti_join(dfa, by = c(gene_name = "gene_name"))
One drawback though: it's quite slow...