Search code examples
rduplicatesdata-munging

Managing duplicates that are not entered as duplicates in R


I have a data set from a state agency and am trying to clean it up. One obstacle is that there are no input standards for titles (e.g., DIR, DIRECTOR, DIR., are all allowable inputs). Another obstacle is that an individual may have several job titles, but only 1 of them is entered.

For example, consider employee #1 below (Emp_1). This person is a faculty member. They teach year after year. But in 2015, in addition to teaching duties, they also picked up some work that would be classified as non-faculty, or exempt. But their ‘real’ classification is faculty. In this example, Emp_1 has been given two records. I’ve done a quick grep to create a new classification titled “job.cat” that helps identify faculty. But in the case of Emp_1 and Emp_4 below, you can see that they get miscategorized as ‘non-faculty’. Now, I could just do a grep for “STIPEND/COORD FAC,EXMT EMP” and count that as faculty too, but there are some people who never teach and are always “STIPEND/COORD FAC,EXMT EMP”, and so should be counted as non-faculty.

To solve this, I’m thinking I should create a second data set of names, job titles, and job categories, clean that up, and then merge it back to this data below where I’d join using employee_name and the correct job. cat. But I’m wondering if there is a better way to do this using an apply like function, an ifelse, or a for-loop.

d <- read.table(text = 'employee_name   job_title   Salary_2012 Salary_2013 Salary_2014 Salary_2015 job.cat
Emp_1   FACULTY 31200   37400   33300   NA  Faculty
Emp_1   "STIPEND/COORD FAC,EXMT EMP"  NA  NA  NA  37300   Non-fac
Emp_2   FACULTY 29300   28400   31800   NA Faculty
Emp_2   "PART TIME FACULTY"   NA  NA  NA  30800   Faculty
Emp_3   FACULTY NA  NA  4300    NA  Faculty
Emp_4   FACULTY 50000   59900   31300   NA  Faculty
Emp_4   "STIPEND/COORD FAC,EXMT EMP"  NA  NA  NA  22000   Non-fac', header = TRUE)

Solution

  • I think you can approach this by grouping results by employee_name, checking if that employee has any faculty job, then defining them as faculty:

    library(dplyr)
    d %>% 
      group_by(employee_name) %>% 
      mutate(job.cat = ifelse(any(grep("FACULTY",job_title)),"Faculty", "Non-fac"))
    

    Your other problem could have a similar solution:

    d %>% 
      group_by(employee_name) %>% 
      mutate(job_title = ifelse(any(grep("DIR",job_title)),"Director", job_title))