Search code examples
rstringdplyrmergestringdist

How do I lock the first digits of the 'by' column in a stringdist join?


I am trying to use stringdist_join to merge two tables. I have built my 'by' variable as the concatenation of three variables which are named as such:

UAI : a serial number nom : surname prenom : name

The code below works well, however I'd like to have a perfect match on the UAI part which is always the first 8 characters of the variable UAInomprenom. How can I do that?

stringdist_join(Ech_final_nom, BSA_affect_nom, 
                by = "UAInomprenom",
                mode = "left",
                ignore_case = FALSE, 
                method = "jw", 
                max_dist = 0.1117, 
                distance_col = "dist") 

Thank you for your help!


Solution

  • I am taking the following two datasets as an example:

    df1 <- structure(list(V1 = c("abcNum1Num1Num1Num1", "abc1Num1Num1Num1Num", 
    "accArv", "accbrf"), V2 = c(1L, 4L, 5L, 2L)), class = "data.frame", row.names = c(NA, 
    -4L))
    
    df2 <- structure(list(V1 = c("abcNun1Nun1Nun1Nun1", "abc1Nun1Nun1Nun1Nun", 
    "accArv", "accNun1Nun1Nun1Nun1"), V2 = c(2L, 5L, 4L, 1L)), class = "data.frame", row.names = c(NA, 
    -4L))
    

    In these two dataframes, the variable V1 is the join by field, in which the 3 first characters are not fuzzy (in your case, there are 8 not fuzzy characters).

    Now, separate the column V1 to have an isolated column with the referred 3 first characters:

    library(fuzzyjoin)
    library(tidyverse)
    
    df1 <- df1 %>% 
      extract(V1, into = c("V1A","V1B"), "(.{3})(.*)")
    
    df2 <- df2 %>% 
      extract(V1, into = c("V1A","V1B"), "(.{3})(.*)")
    

    Finally, apply the fuzzy join and remove the rows where the values of the two columns with the 3-character field are different:

    stringdist_join(df1, df2, 
                    by = "V1B",
                    mode = "left",
                    ignore_case = FALSE, 
                    method = "jw", 
                    max_dist = 0.5) %>% 
      filter(V1A.x == V1A.y) %>% 
      unite("V1",c("V1A.x","V1B.x"),sep="") %>% 
      select(V1,V2=V2.x,V3=V2.y)