Search code examples
rtidyversedata-cleaning

Is there an R function to clean messy salaries in character format?


I have a column of messy salary data. I am wondering if there is a package that has a function made specifically for cleaning this type of messy data. My data looks like:

data.frame(salary = c("40,000-60,000", "40-80K", "$100,000", 
                  "$70/hr", "Between $65-80/hour", "$100k",
                  "50-60,000 a year", "90"))
#>                salary
#> 1       40,000-60,000
#> 2              40-80K
#> 3            $100,000
#> 4              $70/hr
#> 5 Between $65-80/hour
#> 6               $100k
#> 7    50-60,000 a year
#> 8                  90

Created on 2020-12-16 by the reprex package (v0.3.0)

and I would like the clean column to be a numeric at the annual level. I know how to clean this column manually, I'm just wondering if there are any other packages that can help (other than readr::parse_number())

The expected output would look like:

#>   output
#> 1  50000
#> 2  60000
#> 3 100000
#> 4 145600
#> 5 150800
#> 6 100000
#> 7  55000
#> 8  90000

Solution

  • Here are some first steps you can try. I define two functions: one replaces a k or K with three zeros. The other adds leading zeros if one number is denoted in thousands and the other is not.

    rem_k <- function(x) {
      sub("(\\d)[kK]", "\\1,000", x)
    }
    
    add_zero <- function(x) {
      ifelse(grepl("[1-9]0\\-\\d[0,]{2,}", x), sub("([1-9]0)(\\-\\d[0,]{2,})", "\\1,000\\2", x), x)
    }
    

    Finally, I remove all non essential characters:

    df %>% 
      mutate(salary2 = gsub("[^0-9,\\-]", "", add_zero(rem_k(salary))))
    
                   salary       salary2
    1       40,000-60,000 40,000-60,000
    2              40-80K 40,000-80,000
    3            $100,000       100,000
    4              $70/hr            70
    5 Between $65-80/hour         65-80
    6               $100k       100,000
    7    50-60,000 a year 50,000-60,000
    8                  90            90