Search code examples
rsimilarity

How to check the similarity of row elements between data frames


I have multiple data frames, each have 2 columns of interest, ID indicating the samples in each row, and another column called membership, denoting the cluster membership of that sample.

Between data frames, same sample ID represent the same sample, but the membership number is just an arbitrary number. That is sample A and B have membership 1 in df1, and they could have membership 2 in df2, but the conclusion is the same: A and B are staying in the same cluster across df1 and df2.

Now I want to compare between data frames of unequal length, and find out how consistent that sample A stay with sample B in the same cluster.

#----- dummy data

df1 <- data.frame(paste0("S", seq(1:25)), rep(c(1:5), 5))
df2 <- data.frame(paste0("S", seq(1:30)), rep(c(0:4), 6))
colnames(df1) <- c("ID", "membership")
colnames(df2) <- c("ID", "membership")

As you see in the dummy data, S1, S6, S11, S16 and S21 stayed together in df1 (left), and they also stayed together in df2 (right), although the numeric value of their membership is different, and the size of cluster is different.

I can visually check this by using an alluvial plot for a couple data sets, but if I have a few dozen data set like this, I want to have a quantitative number to describe how well preserved/similar samples are in different datasets in terms of their membership.

A simple output could be S1, S6, S11, S16 and S21 between df1 and df2 are 100% together in this example. And if only S1, S6, S11 and S16 are in a same cluster in df2, then cluster 1 from df1 is only 75% intact when checked in df2.

So the process would be identify samples into groups by membership (say 5) in the shorter df, then see how many samples remained together, say 4. Then similarity will be 4/5.

If there are more appropriate ways to do this, please enlighten me. Thank you for pointers.

df1 df2


Solution

  • We could split the ID by membership in both data into a list of vector

    idlist1 <- with(df1, split(ID, membership))
    idlist2 <- with(df2, split(ID, membership))
    

    Then, create a named vector of membership to match the membership in other, order the list elements based on the match and use that in Map to get the corresponding intersect elements of ID

    nm1 <- setNames(as.character(1:5), 0:4)
    Map(intersect, idlist1, idlist2[match(names(idlist2), names(nm1))])