Search code examples
rdplyrinner-join

Prevent power_inner_join from joining columns containing NA, empty and 0 values


Here's my dataframes for reproducing the problem:

df1 <- data.frame(id=c("","","634703802","634927873","635812953","636004739","636101211","636157799","636263106","636752420"),
                  text=c("asdf","cat","dog","mouse","elephant","goose","rat","mice","kitty","kitten"),
                  response=c("y","y","y","n","n","y","y","n","n","y"))

df2 <- data.frame(id=c("","","634703802","634927873","635812953","636004739","636101211","636157799","636263106","636752420","636809222","2004722036","2004894388","2005045755","2005535472","2005630542","2005788781","2005809679","2005838317","2005866692"),
                  text=c("asdf_xyz","cat","dog","mouse","elephant","goose","rat","mice","kitty","kitten","tiger_xyz","lion","leopard","ostrich","kangaroo","platypus","fish","reptile","mammals","amphibians_xyz"),
                  volume=c("1234","432","324","333","2223","412346","7456","3456","2345","2345","6","345","23","2","4778","234","8675","3459","8","9"))

What I have tried:

joined_df1 <- power_full_join(df1, df2, by = c("id"),
                             conflict = rw ~ ifelse(.x != .y,
                                                    NA_integer_, 
                                                    .x), na_matches = "never")

which gives

           id response volume     text
1        <NA>        y   <NA>     <NA>
2        <NA>        y   <NA>     <NA>
3   634703802        y    324      dog
4   634927873        n    333    mouse
5   635812953        n   2223 elephant
6   636004739        y 412346    goose
7   636101211        y   7456      rat
8   636157799        n   3456     mice
9   636263106        n   2345    kitty
10  636752420        y   2345   kitten
11       <NA>     <NA>   1234     <NA>
12       <NA>     <NA>    432     <NA>
13  636809222     <NA>      6     <NA>
14 2004722036     <NA>    345     <NA>
15 2004894388     <NA>     23     <NA>
16 2005045755     <NA>      2     <NA>
17 2005535472     <NA>   4778     <NA>
18 2005630542     <NA>    234     <NA>
19 2005788781     <NA>   8675     <NA>
20 2005809679     <NA>   3459     <NA>
21 2005838317     <NA>      8     <NA>
22 2005866692     <NA>      9     <NA>

but I do not want to join for if NA is present and the expected output I want is:

           id response volume     text
3   634703802        y    324      dog
4   634927873        n    333    mouse
5   635812953        n   2223 elephant
6   636004739        y 412346    goose
7   636101211        y   7456      rat
8   636157799        n   3456     mice
9   636263106        n   2345    kitty
10  636752420        y   2345   kitten
13  636809222     <NA>      6     <NA>
14 2004722036     <NA>    345     <NA>
15 2004894388     <NA>     23     <NA>
16 2005045755     <NA>      2     <NA>
17 2005535472     <NA>   4778     <NA>
18 2005630542     <NA>    234     <NA>
19 2005788781     <NA>   8675     <NA>
20 2005809679     <NA>   3459     <NA>
21 2005838317     <NA>      8     <NA>
22 2005866692     <NA>      9     <NA>

don't worry about the left numbering. Can someone help pls? Thanks.


Solution

  • We could convert the blanks ("") to NA, filter all NA elements from 'id' in both datasets and then do the power_full_join

    library(dplyr)
    library(powerjoin)
    df1 %>% 
      na_if("") %>% 
      filter(complete.cases(id)) %>% 
      power_full_join(df2 %>% 
                       na_if("") %>%
                 filter(complete.cases(id)), by = c("id"),   
                                conflict = rw ~ ifelse(.x != .y,  
                   NA_integer_,  .x), na_matches = "never")