Search code examples
rstringstream

How to compare comma separated string in one column with the comma separated strings other dataframe


I have two df as below df1:

    M1     |
    -------+
    a,b,c  |
    a      |
    b,c    |
    c,b,a  |
    b,a,d  |
    d,a,b,c|
    a,d,c  |
    b      |
    c,d    |
    d,a    |

df2:

X1      |X2
--------+---
a       |1
b       |2
c       |3
d       |4
a,b     |5
a,c     |6
a,d     |7
b,c     |8
b,d     |9
c,d     |10
a,b,c   |11
a,c,d   |12
a,b,d   |13
b,c,d   |14
a,b,c,d |15

can someone help me to match values in df1$M1 and df2$X1. and put the corresponding X2 value in column M2 as below

df1:

M1      |M2
--------+---
a,b,c   |11
a       |1
b,c     |8
c,b,a   |11
b,a,d   |13
d,a,b,c |15
a,d,c   |12
b       |2
c,d     |10
d,a     |7

Can someone help me


Solution

  • X1 and M1 have to be stored as Characters. You can check with str(df1), and re-assign if necessary df1 <- as.character(df1$X1), and the same for df2

    Then, create new columns with the values in alphabetical order:

    df1$Ordered <- sapply(lapply(strsplit(df1$X1, ","), sort),paste,collapse=",")

    df2$Ordered <- sapply(lapply(strsplit(df2$M1, ","), sort),paste,collapse=",")

    Then perform a join like so:

    merge(df1, df2, by="Ordered")

    If you want to include all the values in df1 regardless of whether they have a matching value in df2, add the all.x = TRUE argument. Same logic applies adding all = TRUE (include everything from both data frames), or all.y = TRUE for df2.