Search code examples
rdplyr

Is there a _merge indicator available after a merge?


Is there a way to get the equivalent of a _merge indicator variable after a merge in dplyr?

Something similar to Pandas' indicator = True option that essentially tells you how the merge went (how many matches from each dataset, etc).

Here is an example in Pandas

import pandas as pd

df1 = pd.DataFrame({'key1' : ['a','b','c'], 'v1' : [1,2,3]})
df2 = pd.DataFrame({'key1' : ['a','b','d'], 'v2' : [4,5,6]})

match = df1.merge(df2, how = 'left', indicator = True)

Here, after a left join between df1 and df2, you want to immediately know how many rows in df1 found a match in df2 and how many of them did not

match
Out[53]: 
  key1  v1   v2     _merge
0    a   1  4.0       both
1    b   2  5.0       both
2    c   3  NaN  left_only

and I can tabulate this merge variable:

match._merge.value_counts()
Out[52]: 
both          2
left_only     1
right_only    0
Name: _merge, dtype: int64

I don't see any option available after a, say, left join in dplyr

key1 = c('a','b','c')
v1 = c(1,2,3)
key2 = c('a','b','d')
v2 = c(4,5,6)
df1 = data.frame(key1,v1)
df2 = data.frame(key2,v2)

> left_join(df1,df2, by = c('key1' = 'key2'))
  key1 v1 v2
1    a  1  4
2    b  2  5
3    c  3 NA

Am I missing something here? Thanks!


Solution

  • We create the 'merge' column based on inner_join, anti_join and then bind the rows with bind_rows

    d1 <- inner_join(df1, df2, by = c('key1' = 'key2')) %>%
                        mutate(merge = "both")  
    bind_rows(d1, anti_join(df1, df2, by = c('key1' = 'key2')) %>% 
                 mutate(merge = 'left_only'))