Search code examples
rstringduplicatesdata-cleaning

Cleaning large numbers of manually entered human names in R


I have been instructed to group a dataset of sales by salesperson name for work.

However, to reduce IT administration costs, the company does not create a separate ID or login for each salesperson, instead they have asked them to enter it manually in an app they are using. As you can imagine, this results in lots of duplicates due to data entry errors.

The dataset looks something like this (not their real names):

reprex_days <- seq(1,22,1)
reprex_sales <- rnorm(22, mean=100, sd=50)
reprex_names<- c("Abby Davies", "Abby Davies", "Abby Davies", "Abby  Davis", "Abi Davies", "abby davies", NA,
                 "simon", "Simon Jenkins", "Simon Jenkins", "Simon Jenkins", "Simon   Jenkins", "Simon jenjins",
                 "Toby Jones", "Toby Jones", "toby jones", "toby jones", "toby jonse", "toby", NA, NA, NA)

reprex <- data.frame(cbind(reprex_days, reprex_sales, reprex_names))
names(reprex) <-c("day_ID", "sales", "salesperson")
reprex$sales <- as.numeric(reprex$sales)

If I try to group by salesperson this obviously duplicates because of the multiple versions of each salesperson's name present in the data:

reprex_grouped <- reprex %>%
  group_by(salesperson) %>%
  summarise(total_sales = sum(sales),
            mean_sales = mean(sales),
            days_active = n())

reprex_grouped

gives:

   # A tibble: 13 x 4
   salesperson     total_sales mean_sales days_active
   <fct>                 <dbl>      <dbl>       <int>
 1 Abby  Davis              12       12             1
 2 abby davies              14       14             1
 3 Abby Davies              31       10.3           3
 4 Abi Davies                3        3             1
 5 simon                    20       20             1
 6 Simon   Jenkins           5        5             1
 7 Simon jenjins            13       13             1
 8 Simon Jenkins            46       15.3           3
 9 toby                     18       18             1
10 toby jones               25       12.5           2
11 Toby Jones               17        8.5           2
12 toby jonse                6        6             1
13 NA                       43       10.8           4

I'm planning to use str_remove() and toLower() to remove all the extra spaces and account for case differences, and I know that I'm not ever going to be able to successfully identify staff members when they have only entered their first names (sometimes there are multiple people with the same first name) or have failed to enter anything.

However I'm wondering if there's some way of automatically identifying where the name is one or two characters different from another entry and changing it to whichever comes first in the dataset (for this purpose it doesn't matter whether their name is spelled correctly as much as that as many as possible of their sales are attributed to a single person)? (e.g. "Toby Jonse" c.f. "Toby Jones", "Simon Jenjins" c.f. "Simon Jenkins", "Abby Davis" c.f. "Abby Davies")

Does anyone know of an advanced string pattern recognition package that can do something like this?

I can't do it manually because there are too many names, and because the company want to reuse the code for future sales activities with different staff. (I can't change the identification system; IT don't have the budget to hire someone to manage database permissions so they can't implement a traditional login system.)


Solution

  • It's not perfect, however, a nice option could be the phonetic() function from stringdist library:

    reprex %>%
     group_by(ID = phonetic(salesperson)) %>%
     mutate(salesperson2 = first(salesperson))
    
       day_ID sales salesperson   ID    salesperson2 
       <fct>  <dbl> <fct>         <chr> <fct>        
     1 1         10 Abby Davies   A131  Abby Davies  
     2 2         13 Abby Davies   A131  Abby Davies  
     3 3         14 Abby Davies   A131  Abby Davies  
     4 4         19 Abby  Davis   A131  Abby Davies  
     5 5         22 Abi Davies    A131  Abby Davies  
     6 6          7 abby davies   A131  Abby Davies  
     7 7         18 <NA>          <NA>  <NA>         
     8 8          2 simon         S550  simon        
     9 9          1 Simon Jenkins S552  Simon Jenkins
    10 10        11 Simon Jenkins S552  Simon Jenkins