Search code examples
rdataframefuzzyjoin

Merge 2 data frames by the columns that do not match exactly


I have 2 data frames. I am trying to merge/join them together while specifying how I want rows to align. Mock data below.

df <- data.frame(Race = c("White", "NHPI", "AA"),
                 Cases = c(20, 4, 7)
                 )
df
> df
   Race Cases
1 White    20
2  NHPI     4
3    AA     7

df2 <- data.frame(Race2 = c("African American", "White", "Hawaiian or Pacific Islander"),
                  TotalPopulation = c(200, 400, 50))
df2
> df2
                         Race2 TotalPopulation
1             African American             200
2                        White             400
3 Hawaiian or Pacific Islander              50

What I would like to do is merge the data while specifying which row matches with the other dataset- for example, I want the merge to align "African American" and "AA". Ideal output would look like this-

   Race Cases                        Race2 TotalPopulation
1    AA     7             African American             200
2 White    20                        White             400
3  NHPI     4 Hawaiian or Pacific Islander              50

I am slightly newer to coding, so if you could walk through the code that would be great. Thanks so much.


Solution

  • Based on @M-- comment, you can do:

    df %>%
      left_join(df2 %>% 
                   mutate(Race = case_when(
                     Race2 == "African American" ~ "AA",
                     Race2 == "Hawaiian or Pacific Islander" ~ "NHPI",
                     .default = Race2
                   )))
    
       Race Cases                        Race2 TotalPopulation
    1 White    20                        White             400
    2  NHPI     4 Hawaiian or Pacific Islander              50
    3    AA     7             African American             200
    

    If you can set full name and then extract abbreviations you can do it faster, however, note you have NHPI and Hawaiian or Pacific Islander and there is not match. If you can include N in Hawaiian or Pacific Islander, then you can use:

    df %>% 
      left_join(df2 %>% 
                  mutate(Race = gsub("[:a-z:]*\\s*", "", Race2),
                         Race = sub("W", "White", Race)))
    

    This work under the assumption of the df2 is as follows:

    df2 <- data.frame(Race2 = c("African American", "White", "Naaaa Hawaiian or Pacific Islander"),
                      TotalPopulation = c(200, 400, 50))