I have a column of messy salary data. I am wondering if there is a package that has a function made specifically for cleaning this type of messy data. My data looks like:
data.frame(salary = c("40,000-60,000", "40-80K", "$100,000",
"$70/hr", "Between $65-80/hour", "$100k",
"50-60,000 a year", "90"))
#> salary
#> 1 40,000-60,000
#> 2 40-80K
#> 3 $100,000
#> 4 $70/hr
#> 5 Between $65-80/hour
#> 6 $100k
#> 7 50-60,000 a year
#> 8 90
Created on 2020-12-16 by the reprex package (v0.3.0)
and I would like the clean column to be a numeric at the annual level. I know how to clean this column manually, I'm just wondering if there are any other packages that can help (other than readr::parse_number()
)
The expected output would look like:
#> output
#> 1 50000
#> 2 60000
#> 3 100000
#> 4 145600
#> 5 150800
#> 6 100000
#> 7 55000
#> 8 90000
Here are some first steps you can try.
I define two functions: one replaces a k
or K
with three zeros.
The other adds leading zeros if one number is denoted in thousands and the other is not.
rem_k <- function(x) {
sub("(\\d)[kK]", "\\1,000", x)
}
add_zero <- function(x) {
ifelse(grepl("[1-9]0\\-\\d[0,]{2,}", x), sub("([1-9]0)(\\-\\d[0,]{2,})", "\\1,000\\2", x), x)
}
Finally, I remove all non essential characters:
df %>%
mutate(salary2 = gsub("[^0-9,\\-]", "", add_zero(rem_k(salary))))
salary salary2
1 40,000-60,000 40,000-60,000
2 40-80K 40,000-80,000
3 $100,000 100,000
4 $70/hr 70
5 Between $65-80/hour 65-80
6 $100k 100,000
7 50-60,000 a year 50,000-60,000
8 90 90