Search code examples
rjoininner-join

How to do conditional merge


I have 2 data frames, and I'm trying to do an inner join using a condition. I'll show an example to clarify what I'm trying to do:

A
  cnumero cep numero bairro
1  124,35 124     35      K
2  375,NA 375     NA      L
3   NA,28  NA     28      M

B

  cnumero bairro XY
1  124,35      J  1
2  375,48      L  2
3  135,28      M  3

The first condition to merge is, if A$cep != NA and A$numero != NA, then the join is made using the column cnumero, otherwise the join is made using the column bairro, resulting in:

new_A
  cnumero cep numero XY
1  124,35 124     35  1
2  375,NA 375     NA  2
3   NA,28  NA     28  3

What I have done so far is to make the inner join using this: A[A$cnumero %in% unique(B$cnumero),], since in my real dataframe I have repeated values in data frame B.

Edit: my data example

A = data.frame(cnumero=c("124,35", "375,NA", "NA,28"),cep = c(124, 375, NA), numero = c(35, NA, 28), bairro =  c("K", "L","M"))
B = data.frame(cnumero=c("124,35", "375,48", "135,28"), bairro =  c("J", "L","M"), XY = c(1, 2, 3))
new_A = data.frame(cnumero=c("124,35", "375,NA", "NA,28"),cep = c(124, 375, NA), numero = c(35, NA, 28), XY = c(1, 2, 3))

Solution

  • What about a solution in base R in this way, doing two steps, first the join for the first condition, then for the second, last put results together:

    # the join with the first condition
    A_1 <-  merge(A[!is.na(A$cep) & !grepl('NA',A$cnumero), ],B, by = 'cnumero')
    
    # select the column you need
    A_1 <- A_1[,c("cnumero", "cep","numero","XY")]
    
    # join for the second condition
    A_2 <-  merge(A[is.na(A$cep) | grepl('NA',A$cnumero), ],B, by = 'bairro')
    
    # select columns you need
    A_2 <- A_2[,c("cnumero.x", "cep","numero","XY")]
    
    # rename the second part's columns
    colnames(A_2) <- colnames(A_1)
    
    # now the result 
    new_A <- rbind(A_1,A_2)
    new_A
      cnumero cep numero XY
    1  124,35 124     35  1
    2  375,NA 375     NA  2
    3   NA,28  NA     28  3
    
    # in case you need to remove the "temporary" tables
    # remove(A_1, A_2)