Search code examples
rdataframematchpairwise

Matching values from two column pairs in different data frames in R


I have two data frames which are edge lists containing columns "source" and "target" as their first two columns and the second data frame includes a third column with edge attributes. The two data frames are not of the same length and I want (1) to retrieve the edges from one data frame that are not in the other and (2) to get the values from the second data frame for matching edges.

Example:

> A <- data.frame(source=c("v1", "v1", "v2", "v2"), target=c("v2", "v4", "v3", "v4"))
> B <- data.frame(source=c("V1", "V2", "v1", "V4", "V4", "V5"), target=c("V2", "V5", "V3", "V3", "V2", "V4"), variable=c(3,4,0,2,1,0))
> A
  source target
1     v1     v2
2     v1     v4
3     v2     v3
4     v2     v4
> B
  source target variable
1     V1     V2        3
2     V2     V5        4
3     v1     V3        0
4     V4     V3        2
5     V4     V2        1
6     V5     V4        0

desirable outcome (1):

  source target
1     V2     V5
2     V1     V3
3     V4     V3
4     V5     V4

desirable outcome (2):

  source target variable
1     V1     V2        3
2     V2     V4        1

How can this be achieved with R?


Solution

  • The first you will get with an anti_join, though you will need to anti-join on both combinations of source and target since direction appears not to matter in your example. Note I've had to use toupper because the capitalization in your example was erratic and the example suggested case should be ignored.

    library(dplyr)
    
    anti_join(anti_join(B, A %>% mutate_all(toupper), 
                        by = c("source", "target")),
              A %>% mutate_all(toupper), 
              by = c(target = "source", source = "target")) %>%
      select(-variable)
    #>   source target
    #> 1     V2     V5
    #> 2     v1     V3
    #> 3     V4     V3
    #> 4     V5     V4
    

    The second result you can get from binding two inner_joins:

    bind_rows(inner_join(B, A %>% mutate_all(toupper), 
                         by = c("source", "target")), 
              inner_join(B, A %>% mutate_all(toupper), 
                         by = c(source = "target", target = "source")))
    #>   source target variable
    #> 1     V1     V2        3
    #> 2     V4     V2        1