Search code examples
rdataframemany-to-manyleft-join

Joining Dataframes with Many to Many Relationship in R


I'm trying to join two dataframes in R, each of which has multiple repetitions of the same ID number. This is to be expected within the data and is not a data quality issue.

It was my understanding that I should do a left join between the two dataframes, specify the many to many relationship, and this would work. However, my IDs get duplicated in a way I don't expect.

Here is some test data:

test4 <- data.frame(ID = c('A', 'A', 'B', 'B', 'C'), Score = c(1,2,3,4,5))
test5 <- data.frame(ID = c('A', 'A', 'B', 'B', 'C'), Race = c('W','W','B','B','W'))
test6 <- test4 %>% left_join(test5, by = "ID",  relationship = "many-to-many")

This outputs the data like so:

  ID Score Race
   A     1    W
   A     1    W
   A     2    W
   A     2    W
   B     3    B
   B     3    B
   B     4    B
   B     4    B
   C     5    W

When I want it to output like so:

  ID Score Race
   A     1    W
   A     2    W
   B     3    B
   B     4    B
   C     5    W

I'm willing to bet that I'm missing something simple, but I just can't figure out what. Any advice much appreciated!


Solution

  • you can use distinct before applying left_join, e.g.,

    test4 %>%
        left_join(distinct(test5), by = "ID")
    

    or specify multiple = "first" in left_join

    test4 %>%
      left_join(test5, by = "ID", multiple = "first")
    

    which gives

      ID Score Race
    1 A  1     W   
    2 A  2     W   
    3 B  3     B   
    4 B  4     B   
    5 C  5     W