Search code examples
rfilterinner-joinmismatchanti-join

How to return row values that match column 'id' in both df1 and df2 but not column 'text' and return NA to the mismatch in column 'text'?


Below are my two dataframes, df1 and df2

df1 <- data.frame(id=c("632592651","633322173","634703802","634927873","635812953","636004739","636101211","636157799","636263106","636752420"),text=c("asdf","cat","dog","mouse","elephant","goose","rat","mice","kitty","kitten"),response=c("y","y","y","n","n","y","y","n","n","y"))

id     text response
1  632592651     asdf        y
2  633322173      cat        y
3  634703802      dog        y
4  634927873    mouse        n
5  635812953 elephant        n
6  636004739    goose        y
7  636101211      rat        y
8  636157799     mice        n
9  636263106    kitty        n
10 636752420   kitten        y

df2 <- data.frame(id=c("632592651","633322173","634703802","634927873","635812953","636004739","636101211","636157799","636263106","636752420","636809222","2004722036","2004894388","2005045755","2005535472","2005630542","2005788781","2005809679","2005838317","2005866692"),
                  text=c("asdf_xyz","cat","dog","mouse","elephant","goose","rat","mice","kitty","kitten","tiger_xyz","lion","leopard","ostrich","kangaroo","platypus","fish","reptile","mammals","amphibians_xyz"),
                  volume=c("1234","432","324","333","2223","412346","7456","3456","2345","2345","6","345","23","2","4778","234","8675","3459","8","9"))

 id           text volume
1   632592651       asdf_xyz   1234
2   633322173            cat    432
3   634703802            dog    324
4   634927873          mouse    333
5   635812953       elephant   2223
6   636004739          goose 412346
7   636101211            rat   7456
8   636157799           mice   3456
9   636263106          kitty   2345
10  636752420         kitten   2345
11  636809222      tiger_xyz      6
12 2004722036           lion    345
13 2004894388        leopard     23
14 2005045755        ostrich      2
15 2005535472       kangaroo   4778
16 2005630542       platypus    234
17 2005788781           fish   8675
18 2005809679        reptile   3459
19 2005838317        mammals      8
20 2005866692 amphibians_xyz      9

I would like to create a dataframe, df3, that return row values that match column 'id' and 'text' in both df1 and df2 but return NA to the mismatch in column 'text'

Desired output of df3:

 id   text response volume
1  632592651       NA        y   1234
2  633322173      cat        y    432
3  634703802      dog        y    324
4  634927873    mouse        n    333
5  635812953 elephant        n   2223
6  636004739    goose        y 412346
7  636101211      rat        y   7456
8  636157799     mice        n   3456
9  636263106    kitty        n   2345
10 636752420   kitten        y   2345

What I have tried:

df3 <- df1 %>%
  anti_join(df2, by = c("id","text"))

I then change all non zero value in column 'text' to NA

df3$text[df3$text != 0] <- NA

Did inner_join between df1 and df2 to get df4 (match both "id" and "text")

df4 <- df1 %>% 
    inner_join(df2, by = c("id","text"))

And inner_join df5 with df3 to add "volume" column

df5 <- df3 %>%
  inner_join(df2, by= c("id")) %>%
  select(id,text.x,response,volume)

Change column name 'text.x' to 'text'

colnames(df5)[colnames(df5)%in%c("text.x")] <- c("text")

And finally binding the df4 and df5 to get the desired output:

df6 <- rbind(df5,df4)

id     text response volume
1  632592651     <NA>        y   1234
2  633322173      cat        y    432
3  634703802      dog        y    324
4  634927873    mouse        n    333
5  635812953 elephant        n   2223
6  636004739    goose        y 412346
7  636101211      rat        y   7456
8  636157799     mice        n   3456
9  636263106    kitty        n   2345
10 636752420   kitten        y   2345

It's getting extremely difficult to deal with if I have many columns. Can someone advise how to simplify this method? Thanking in advance.


Solution

  • Using dplyr and only joining by "id" you can simplify it like this:

    library(dplyr)
    
    inner_join(x = df1, 
               y = df2, 
               by = "id") %>%
      mutate_if(is.factor, as.character) %>%
      mutate(text = ifelse(test = text.x != text.y, 
                           yes = NA, 
                           no = text.x)) %>%
      select(id, text, response, volume)
    
    #>           id     text response volume
    #> 1  632592651     <NA>        y   1234
    #> 2  633322173      cat        y    432
    #> 3  634703802      dog        y    324
    #> 4  634927873    mouse        n    333
    #> 5  635812953 elephant        n   2223
    #> 6  636004739    goose        y 412346
    #> 7  636101211      rat        y   7456
    #> 8  636157799     mice        n   3456
    #> 9  636263106    kitty        n   2345
    #> 10 636752420   kitten        y   2345