Search code examples
rstringfunctioncountstringr

In R, how do you find the number of different values between 2 character strings?


I'm trying to see the number of new employees a manager got between time one and time 2. I have a string of all employee ids that roll up under that manager.

My below code always says there is 1 new employee, but as you can see, there's 2. How do I find out how many new employees there are? The ids aren't guaranteed to always be in the same order, but they will always be split by a ", ".

library(dplyr)
library(stringr)

#First data set
mydata_q2 <- tibble(
  leader = 1,
  reports_q2 = "2222, 3333, 4444"
) 

#Second dataset
mydata_q3 <- tibble(
  leader = 1,
  reports_q3 = "2222, 3333, 4444, 55555, 66666" 
) 

#Function to count number of new employees
calculate_number_new_emps <- function(reports_time1, reports_time2) {
  time_1_reports <- ifelse(is.na(reports_time1), character(0), str_split(reports_time1, " ,\\s*")[[1]])
  time_2_reports <- str_split(reports_time2, " ,\\s*")[[1]]
  num_new_employees <- length(setdiff(time_1_reports, time_2_reports))
  num_new_employees
}

#Join data and count number of new staff--get wrong answer
mydata_q2 %>%
  left_join(mydata_q3) %>%
  mutate(new_staff_count = calculate_number_new_emps(reports_q2, reports_q3))

EDIT:

The output that I want is for new_staff_count = 2 for this example.

That's because there are 2 new employees (55555 and 66666) in q3 that weren't in time q2.


Solution

  • The ifelse statement is not working correctly. You need to use the if/then/else construct. Then calculate the difference between the two vector lenghts.

    calculate_number_new_emps <- function(reports_time1, reports_time2) {
       if (is.na(reports_time1)) 
          {time_1_reports <-character(0)}
       else 
          {time_1_reports <- str_split(reports_time1, ",\\s*")[[1]]}
       
       print(time_1_reports)
       time_2_reports <- str_split(reports_time2, ",\\s*")[[1]]
       num_new_employees <- length(time_2_reports) - length(time_1_reports)
       num_new_employees
    }
    
    #Join data and count number of new staff--get wrong answer
    mydata_q2 %>%
       left_join(mydata_q3) %>%
       mutate(new_staff_count = calculate_number_new_emps(reports_q2, reports_q3))
    

    EDIT from Original Poster:

    Thank you, Dave! I was able to simplify. Also, I modified the equation because I got negative numbers of new staff if someone had more count at time 1 than time 2, and if employees just changed, then it gave the wrong count.

    calculate_number_new_emps <- function(reports_time1, reports_time2) {
      time_1_reports <- str_split(reports_time1, ", ")[[1]]
      time_2_reports <- str_split(reports_time2, ", ")[[1]]
      length(setdiff(time_2_reports, time_1_reports))
    }