Search code examples
rmergerbind

Merge dataframes rows by shared pattern


  1. I want merge 2 dataframes based on a shared pattern.

  2. The pattern is the ID name (here in bold): ID=HAND2;ACS=20 as "ID=(.+);ACS"

  3. If the ID is a match in both dataframes, then combine the respective rows!

DF1                   DF2                  MERGED ( DF2 + DF1 )

col1  col2      col1  col2                 col1  col2             col3  col4
HAND2 H2        OFS   ID=GATA5;ACS=45      OFS   ID=GATA5;ACS=45
HAND6 H6        FAM   ID=HAND2;ACS=20      FAM   ID=HAND2;ACS=20  HAND2   H2

In this example (HAND2) ID is matched, then, DF1 and DF2 matched rows are combined/merged.


Script tried

MERGED <- merge(data.frame(DF1, row.names=NULL), data.frame(DF2, row.names=NULL), by = ("ID=(.+);ACS"), all = TRUE)[-1]

error

Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column

I am struggling in finding a similar command, where in alternative to column-names, I can instead match dataframes rows by a shared pattern.

Thank you in advance for your help.


Solution

  • You may try fuzzyjoin. In the match_fun argument you can define a function for your specific needs.

    In your case gsub is extracting the pattern of the DF2 col2 variable. And with str_detect the extraction is compared to the col1 column of DF1.

    Data

    DF1 <- read.table(text = "col1  col2 
    HAND2 H2  
    HAND6 H6", header = T) 
    
    DF2 <- read.table(text = "col1  col2
    OFS   ID=GATA5;ACS=45
    FAM   ID=HAND2;ACS=20", header = T)
    

    Code

    library(fuzzyjoin)
    library(stringr)
    
    DF2 %>% 
      fuzzy_left_join(DF1, 
                      by = c("col2"= "col1"),
                      match_fun = function(x,y) str_detect(y, gsub("ID=(.+);(.*)", "\\1", x)) )
    
    

    Output

      col1.x          col2.x col1.y col2.y
    1    OFS ID=GATA5;ACS=45   <NA>   <NA>
    2    FAM ID=HAND2;ACS=20  HAND2     H2