Search code examples
rdataframemergematchpartial

How to merge two data frames with specific string match in columns in R?


I have two dataframes data1 and data2 which have information like below:

dput(data1)

structure(list(ProfName = c("Hua (Christine) Xin", "Dereck Barr-Pulliam", 
"Lisa M. Blum", "Russell  Williamson", "William D. Stout", "Michael F. Wade", 
"Sheila A.  Johnston", "Julie Huang", "Alan Attaway", "Alan Levitan", 
"Benjamin P. Foster", "Carolyn M.  Callahan"), Title = c(" PhD", 
" PhD", " LLM", " PhD", " PhD", " CPA", " MS", " PhD", " PhD", 
" PhD", " PhD", " PhD"), Profession = c("Assistant Professor", 
"Assistant Professor", "Instructor", "Assistant Professor", "Associate Professor and Director", 
"Instructor", "Instructor", "Associate Professor", "Professor", 
"Professor", "Professor", "Brown-Forman Professor of Accountancy"
)), row.names = c(8L, 18L, 25L, 36L, 49L, 50L, 56L, 69L, 71L, 
82L, 88L, 89L), class = "data.frame")

It looks like below:

enter image description here

dput(data2)

structure(list(ProfName = c("Blandford, K     ", "Okafor, A     ", 
"Johnston, S     ", "Rolen, R     ", "Attaway, A     ", "Xin, H     ", 
"Huang, Y     ", "Stout, W     ", "Williamson, R     ", "Callahan, C     ", 
"Foster, B     ", "Blum, L     ", "Levitan, A     ", "Barr-Pulliam, D     ", 
"Wade, M     ")), row.names = c(NA, -15L), class = "data.frame")

data2 looks like below:

enter image description here

I wanted to merge two dataframes, but the names look different. Only a specific string is matching between two dataframes with column ProfName. The data should be merged and if the names don't have any information it should be empty. If they don't have any information in the columns Title and Profession, both ProfName and New columns should have the same name.

I tried using merge, but it didn't give the desired output.

merge(data1, data2, by="ProfName", all.x=TRUE, all.y = TRUE)

The output should look like below:

enter image description here


Solution

  • Here's a simple solution:

    library(stringr)
    library(dplyr)
    library(tidyr)
    library(magrittr)
    
    data1 %<>% mutate(lname = str_extract(ProfName, "[A-Za-z\\-]+$"))
    data2 %<>% mutate(lname = str_extract(ProfName, "^[A-Za-z\\-]+"))
    
    df <- merge(data1, data2, all.y = TRUE, by = "lname")
    
    head(df)
    
    #          lname           ProfName.x Title                            Profession           # ProfName.y
    # 1      Attaway         Alan Attaway   PhD                             Professor      Attaway, A     
    # 2 Barr-Pulliam  Dereck Barr-Pulliam   PhD                   Assistant Professor Barr-Pulliam, D     
    # 3    Blandford                 <NA>  <NA>                                  <NA>    Blandford, K     
    # 4         Blum         Lisa M. Blum   LLM                            Instructor         Blum, L     
    # 5     Callahan Carolyn M.  Callahan   PhD Brown-Forman Professor of Accountancy     Callahan, C     
    # 6       Foster   Benjamin P. Foster   PhD                             Professor       Foster, B