Search code examples
rdataframefor-loopmergeoverlap

R merge dataframes only where multiple columns overlap


I have 2 large dataframes (+-300k rows both). Dataframe A contains information about genes. I want to add information to this dataframe from dataframe B, if certain values overlap. If a row in df.B exists where df.B$X1 = df.A$X1 & df.B$X2 = df.A$X2, then I want the value of df.B$X3 to be added to a new column in df.A (df.A$X3), specifically at the right row. Basically I want to merge df.A and df.B based on column X1 and X2. I started off with for loops, but there must be an easier way... Hope I explained my question good enough.

Thanks!

> df.A
  X1 X2
1  1  1
2  1  2
3  1  3
4  2  1
5  2  2
6  2  3

> df.B
  X1 X2  X3
1  1  1 0.5
2  1  8 0.3
3  1  9 0.2
4  2  2 0.4
5  2  9 0.3
6  2 10 0.2

Desired outcome:

> df.A
  X1 X2 X3
1  1  1 0.5
2  1  2 NA
3  1  3 NA
4  2  1 NA
5  2  2 0.4
6  2  3 NA

Solution

  • You want a left join

    dplyr

    library(dplyr)
    df.A %>% left_join(df.B, by = c("X1","X2"))
    

    data.table

    library(data.table)
    merge(setDT(df.A), setDT(df.B), all.x = TRUE, by = c("X1","X2"))