Search code examples
rstringjoindata-cleaningstringdist

joining on inexact strings in R


I am looking to join two tables.. however the data I am looking to join on does not match exactly.. joining on NFL player names..

data sets below..

> dput(att75a)
structure(list(rusher_player_name = c("A.Ekeler", "A.Jones", 
"A.Kamara", "A.Mattison", "A.Peterson", "B.Hill"), mean_epa = c(-0.110459963350783, 
0.0334332018597805, -0.119488111742492, -0.155261835310445, -0.123485646124451, 
-0.0689611296359916), success_rate = c(0.357664233576642, 0.40495867768595, 
0.401129943502825, 0.283018867924528, 0.322727272727273, 0.35
), plays = c(137L, 242L, 177L, 106L, 220L, 80L)), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -6L))

> dput(rb2019capa)
structure(list(rusher_player_name = c("Aaron Jones", "Adrian Peterson", 
"Alexander Mattison", "Alvin Kamara", "Austin Ekeler", "Brian Hill"
), Team = c("Packers", "Redskins", "Vikings", "Saints", "Chargers", 
"Falcons"), `Salary Cap Value` = c(695487, 1780000, 700545, 1050693, 
646668, 645000), `Cash Spent` = c(645000, 2530000, 1317180, 807500, 
645000, 645000)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-6L))

for example I am trying to join A.Mattison on Alexander Mattison.. and so on..

i experimented with stringdist and fuzzyjoin but could not solve my problem..

please consider... took the head() of each dataset to condense per question asking guidelines.. original data sets have lengths of 51 obs. and 168 obs... will that affect how the join is performed?

What is the best way to go about cleaning these names?

thank you for your time..


Solution

  • Use sub to replace first name with first initial.

    library(dplyr)
    
    rb2019capa %>%
      mutate(rusher_player_name=
             sub("^([A-Z])\\S+\\s([A-Za-z].*)$", "\\1.\\2", rusher_player_name)) %>%
      inner_join(att75a, by="rusher_player_name") # or left_join (up to you)
    

    # A tibble: 6 x 7
      rusher_player_name Team     `Salary Cap Value` `Cash Spent` mean_epa success_rate plays
      <chr>              <chr>                 <dbl>        <dbl>    <dbl>        <dbl> <int>
    1 A.Jones            Packers              695487       645000   0.0334        0.405   242
    2 A.Peterson         Redskins            1780000      2530000  -0.123         0.323   220
    3 A.Mattison         Vikings              700545      1317180  -0.155         0.283   106
    4 A.Kamara           Saints              1050693       807500  -0.119         0.401   177
    5 A.Ekeler           Chargers             646668       645000  -0.110         0.358   137
    6 B.Hill             Falcons              645000       645000  -0.0690        0.35     80