Search code examples
rjoindplyrinner-joinsemi-join

semi_join in R but pull back duplicates


I'm having issues with semi_join from dplyr. Ideally I would like to do a semi join on dfA against dfB. dfA has duplicate values, and so does dfB. I want to pull back all values from dfA that have any matches against dfB even duplicates in dfA.

dfA              dfB               >>     dfC
x    y    z      x    g                   x    y    z   
1    r    5      1    lkm                 1    r    5
1    b    4      1    pok                 1    b    4
2    4    e      2    jij                 2    4    e
3    5    r      2    pop                 3    5    r
3    9    g      3    hhg                 3    9    g
4    3    0      5    trt

What I would like to get is the dfC output above. Because there is AT LEAST 1 match of x, it pulls back all x's in dfA

semi_join(dfA, dfB, by = "x")
dfC
x    y    z  
1    r    5
2    4    e
3    5    r


inner_join(dfA, dfB, by = "x")
x    y    z    g  
1    r    5    lkm
1    r    5    pok
1    b    4    lkm
1    b    4    pok
2    4    e    jij
2    4    e    pop
3    5    r    hhg
3    9    g    hhg

Neither of which give me the right result. Any help would be great! Thanks in advance


Solution

  • not sure why you need a join : just use %in%

    library(data.table)
    setDT(dfA)[x %in% dfB$x,]
    
    # simple base R approach :
    dfA[dfA$x %in% dfB$x,]