Search code examples
rjoinmatching

Matching people with similar names: R


I am working in R

My data has 500,000 rows in, but small example is used here.

I have some data for staff working in schools. Some people work in one school, some two, some three etc. Each school they work in is a row of data.

Schools do not always record the first names the same for an individual. e.g. one school records as Will, another William.

I also have this assumption: for an individual working in more than one school, their second name and date of birth is always recorded the same in each school.

Based on the similarity of their first name, I want a way of identifying the people who are likely to the same person, and then allocate them an id.

There will be some sort of cut off e.g. greg and griffin are likely to not be the same person, even though they share the same first 2 letters.

sample data:

data_current <- data.frame(first_name = c("will", "william", "william", "laura", "jessica", "jessicalouise", "james", "greg", "griffin"), 
                           last_name = c("smith", "smith", "smith", "maxwell", "maxwell", "maxwell", "lead", "jones", "jones"),
                           date_of_birth = c("2000-01-02","2000-01-02", "2000-01-02", "2007-01-02","2007-01-02","2007-01-02","1999-01-02","2004-01-02","2004-01-02"), 
                           school_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9))
first_name second_name date_of_birth school_id
will smith 2000-01-02 1
william smith 2000-01-02 2
william smith 2000-01-02 3
laura maxwell 2007-01-02 4
jessica maxwell 2007-01-02 5
jessicalouise maxwell 2007-01-02 6
james lead 1999-01-02 7
greg jones 2004-01-02 8
griffin jones 2004-01-02 9

desired data:

It is likely that the first three people are the same person, so are allocated the same person_id, and so on...

data_desired <- data.frame(first_name = c("will", "william", "william", "laura", "jessica", "jessicalouise", "james", "greg", "griffin"), 
                           last_name = c("smith", "smith", "smith", "maxwell", "maxwell", "maxwell", "lead", "jones", "jones"),
                           date_of_birth = c("2000-01-02","2000-01-02", "2000-01-02", "2007-01-02","2007-01-02","2007-01-02","1999-01-02","2004-01-02","2004-01-02"), 
                           school_id = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
                           person_id = c(1, 1, 1, 2, 3, 3, 4, 5, 6))
first_name second_name date_of_birth school_id person_id
will smith 2000-01-02 1 1
william smith 2000-01-02 2 1
william smith 2000-01-02 3 1
laura maxwell 2007-01-02 4 2
jessica maxwell 2007-01-02 5 3
jessicalouise maxwell 2007-01-02 6 3
james lead 1999-01-02 7 4
greg jones 2004-01-02 8 5
griffin jones 2004-01-02 9 6

Does anyone have an suggestions of how to solve this?


Solution

  • Here is one possibility using string similarity. It works well in your example because you have complete separation in your data. However, over 500K rows you may run into some issues with this, but it could get you started:

    library(stringdist)
    library(dplyr)
    library(purrr)
    
    group_first <- function(dat, cutoff = 0.5) {
      if (n_distinct(dat) == 1) return(1)
    
      M <- stringsimmatrix(dat$first_name, dat$first_name, method = "cosine", q = 2)
      l <- unique(lapply(seq_len(nrow(M)), \(x) dat$first_name[M[x, ] > cutoff]))
      
      person_id <- map(dat$first_name, ~ which(grepl(.x, l, fixed = T))) |>
        modify_if(~ length(.x) > 1, ~ 0)
      
      return(unlist(person_id))
    }
    
    data_current |>
      mutate(person_id = group_first(pick(everything())), .by = c(date_of_birth, last_name)) |>
      # review output before running next mutate
      mutate(person_id = cur_group_id(), .by = c(date_of_birth, last_name, person_id))
    

    Output

         first_name last_name date_of_birth school_id person_id
    1          will     smith    2000-01-02         1         1
    2       william     smith    2000-01-02         2         1
    3       william     smith    2000-01-02         3         1
    4         laura   maxwell    2007-01-02         4         2
    5       jessica   maxwell    2007-01-02         5         3
    6 jessicalouise   maxwell    2007-01-02         6         3
    7         james      lead    1999-01-02         7         4
    8          greg     jones    2004-01-02         8         5
    9       griffin     jones    2004-01-02         9         6
    

    How it works

    1. Take each last name and date of birth group and compute a bigram character cosine similarity score. If this similarity score is above 0.5 (by default) then first names are grouped together.
    2. It is possible that one first name might be in more than one first name grouping, if so then a zero is returned and these special cases will need further review. For example, "jess" could match to "jessica" and "jesse", but "jesse" might not necessarily match to "jessica". At this point the IDs are unique within last name and date of birth, but not overall.
    3. After review in step #2 (this will be something you will need to do with a finer algorithm or manually) the IDs are then recalculated so they are distinct across individuals.

    As mentioned in the comments this can be tricky for names that have similar bigrams but are not the same. These cases will be hard to distinguish. Also, since this is just a lexical comparison it likely will not work well for shortened names. For example, "Bob" short for "Robert" or "Dick" short for "Richard".

    You may also consider using other string similarity measures. For example, you can specify the Jaro–Winkler similarity (method = 'jw') with a weight (p = 0.10). This will weight first names that start the same ("will" versus "william"), resulting in a higher similarity score and potential separation.