Search code examples
rdata-cleaning

Data Cleaning in R: remove test customer names


I am handling customer data that has customer first and last name. I want to clean the names of any random keystrokes. Test accounts are jumbled in the data-set and have junk names. For example in the below data I want to remove customers 2,5,9,10,12 etc. I would appreciate your help.

 Customer Id    FirstName   LastName
1   MARY    MEYER
2   GFRTYUIO    UHBVYY
3   CHARLES BEAL
4   MARNI   MONTANEZ
5   GDTDTTD DTTHDTHTHTHD
6   TIFFANY BAYLESS
7   CATHRYN JONES
8   TINA    CUNNINGHAM
9   FGCYFCGCGFC FGCGFCHGHG
10  ADDHJSDLG   DHGAHG
11  WALTER  FINN
12  GFCTFCGCFGC CG GFCGFCGFCGF
13  ASDASDASD   AASDASDASD
14  TYKTYKYTKTY YTKTYKTYK
15  HFHFHF  HAVE
16  REBECCA CROSSWHITE
17  GHSGHG  HGASGH
18  JESSICA TREMBLEY
19  GFRTYUIO    UHBVYY
20  HUBHGBUHBUH YTVYVFYVYFFV
21  HEATHER WYRICK
22  JASON   SPLICHAL
23  RUSTY   OWENS
24  DUSTIN  WILLIAMS
25  GFCGFCFGCGFC    GRCGFXFGDGF
26  QWQWQW  QWQWWW
27  LIWNDVLIHWDV    LIAENVLIHEAV
28  DARLENE SHORTRIDGE
29  BETH    HDHDHDH
30  ROBERT  SHIELDS
31  GHERDHBXFH  DFHFDHDFH
32  ACE TESSSSSRT
33  ALLISON AWTREY
34  UYGUGVHGVGHVG   HGHGVUYYU
35  HCJHV   FHJSEFHSIEHF

Solution

  • You can calculate variability strength of full name (combine FirstName and LastName) by calculating length of unique letters in full name divided by total number of characters in the full name. Then, just remove the names that has low variability strength. This means that you are removing the names that has a high frequency of same random keystrokes resulting in low variability strength.

    I did this using charToRaw function because it very faster and using dplyr library, as below:

    # Building Test Data
    df <- data.frame(CustomerId = c(1, 2, 3, 4, 5, 6, 7), 
              FirstName = c("MARY", "FGCYFCGCGFC", "GFCTFCGCFGC", "ASDASDASD", "GDTDTTD", "WALTER", "GFCTFCGCFGC"),
              LastName = c("MEYER", "FGCGFCHGHG", "GFCGFCGFCGF", "AASDASDASD", "DTTHDTHTHTHD", "FINN", "CG GFCGFCGFCGF"), stringsAsFactors = FALSE)
    
    
    #test data: df
    #   CustomerId    FirstName         LastName
    #1         1           MARY            MEYER
    #2         2    FGCYFCGCGFC       FGCGFCHGHG
    #3         3    GFCTFCGCFGC      GFCGFCGFCGF
    #4         4      ASDASDASD       AASDASDASD
    #5         5        GDTDTTD     DTTHDTHTHTHD
    #6         6         WALTER             FINN
    #7         7    GFCTFCGCFGC   CG GFCGFCGFCGF
    
    library(dplyr)
    df %>%
      ## Combining FirstName and LastName
      mutate(FullName = paste(FirstName, gsub(" ", "", LastName, fixed = TRUE))) %>%
      group_by(FullName) %>%
      ## Calculating variability strength for each full name
      mutate(Variability = length(unique(as.integer(charToRaw(FullName))))/nchar(FullName))%>%
      ## Filtering full name, I set above or equal to 0.4 (You can change this)
      ## Meaning we are keeping full name that has variability strength greater than or equal to 0.40
      filter(Variability >= 0.40)
    
    
    # A tibble: 2 x 5
    # Groups:   FullName [2]
    # CustomerId FirstName LastName    FullName   Variability
    #  <dbl>     <chr>      <chr>        <chr>        <dbl>
    #1   1        MARY      MEYER     MARY MEYER    0.6000000
    #2   6      WALTER      FINN     WALTER FINN    0.9090909