Search code examples
rdatasetdate-arithmetic

create dataset according to joint/left dates - R


I have the following dataset with information about employees joining and leaving an organisation:

dataset1 <- read.table(
  text = "
Employee       Organisation      Joint_date          Left_date
G223           A123              1993-05-15          2019-05-01
G223           A123              2020-04-11          NA
G233           A123              2018-02-20          NA
G234           A123              2015-09-04          NA
G111           A333              1980-10-03          2019-09-27
G122           A333              2000-11-16          NA
G177           A333              2005-01-19          NA
G330           A333              2002-12-24          NA
G556           A333              2018-05-01          2019-03-04
G555           A445              2015-11-18          NA
G556           A445              2005-09-01          2018-03-04
G557           A445              1989-04-05          NA",
  header = TRUE)

dataset1$Employee <- as.factor(dataset1$Employee)
dataset1$Organisation <- as.factor(dataset1$Organisation)
dataset1$Joint_date <- as.Date(dataset1$Joint_date, format="%Y-%m-%d")
dataset1$Left_date <- as.Date(dataset1$Left_date, format="%Y-%m-%d")

I have created dataset2 (monthly dataset) that goes from 2018-01-31 up to 2021-06-30:

dataset2_dates=c("2018-01-31","2018-02-28","2018-03-31","2018-04-30","2018-05-31","2018-06-30","2018-07-31","2018-08-31","2018-09-30","2018-10-31","2018-11-30","2018-12-31","2019-01-31","2019-02-28","2019-03-31","2019-04-30","2019-05-31","2019-06-30","2019-07-31","2019-08-31","2019-09-30","2019-10-31","2019-11-30","2019-12-31","2020-01-31","2020-02-29","2020-03-31","2020-04-30","2020-05-31","2020-06-30","2020-07-31","2020-08-31","2020-09-30","2020-10-31","2020-11-30","2020-12-31","2021-01-31","2021-02-28","2021-03-31","2021-04-30","2021-05-31","2021-06-30")

# add dates
dataset2 <- expand.grid(Organisation = unique(dataset1$Organisation),
                       Month = dataset2_dates)
  
## sort
dataset2 <- dataset2[order(dataset2$Organisation, dataset2$Month),]
## reset id
rownames(dataset2) <- NULL

dataset2$Organisation <- as.factor(dataset2$Organisation)
dataset2$Month <- as.Date(dataset2$Month, format="%Y-%m-%d")

I would like to end up with the following dataset3:

Organisation    | Month         | Nr_employees      
A123            | 2018-01-31    | 2
A123            | 2018-02-28    | 3
A123            | 2018-03-31    | 3
A123            | 2018-04-30    | 3
A123            | 2018-05-31    | 3
A123            | 2018-06-30    | 3
A123            | 2018-07-31    | 3
A123            | 2018-08-31    | 3
A123            | 2018-09-30    | 3
A123            | 2018-10-31    | 3
A123            | 2018-11-30    | 3
A123            | 2018-12-31    | 3
A123            | 2019-01-31    | 3
A123            | 2019-02-28    | 3
A123            | 2019-03-31    | 3
A123            | 2019-04-30    | 3
A123            | 2019-05-31    | 3
A123            | 2019-06-30    | 2
A123            | 2019-07-31    | 2
A123            | 2019-08-31    | 2
A123            | 2019-09-30    | 2
A123            | 2019-10-31    | 2
A123            | 2019-11-30    | 2
A123            | 2019-12-31    | 2
A123            | 2020-01-31    | 2
A123            | 2020-02-29    | 2
A123            | 2020-03-31    | 2
A123            | 2020-04-30    | 3
A123            | 2020-05-31    | 3
A123            | 2020-06-30    | 3
A123            | 2020-07-31    | 3
A123            | 2020-08-31    | 3
A123            | 2020-09-30    | 3
A123            | 2020-10-31    | 3
A123            | 2020-11-30    | 3
A123            | 2020-12-31    | 3
A123            | 2021-01-31    | 3
A123            | 2021-02-28    | 3
A123            | 2021-03-31    | 3
A123            | 2021-04-30    | 3
A123            | 2021-05-31    | 3
A123            | 2021-06-30    | 3
.....

Note: If an employee joins on the last day of the month or leaves on the first day of the month, it still counts as if the employee was there the whole month.

And dataset4 that summarises data from 2018-01-31 to 2021-06-30:

Organisation  | Average Nr_employees  | Nr_employees joined | Nr_employess left | Nr_employess stayed the whole time
A123          | 115/42 = 2.7          | 2                   | 1                 | 1
....

Any ideas on how to generate dataset3 and dataset4?


Solution

  • I prefer to work with the data.table package - for problems like creating dataset3, the non-equijoin functionality is a great fit.

    library(data.table)
    setDT(dataset1)    
    
    dataset2 <- CJ(Organisation = dataset1[,unique(Organisation)],
                   ## This is an option to generate the month sequence based on the first date in dataset1 to present
                   # Month = seq.Date(from = as.Date(cut.Date(dataset1[,min(Joint_date)], breaks = "months")),
                   #                  to = as.Date(cut.Date(Sys.Date(), breaks = "months")),
                   #                  by = "month") - 1
                   ## Otherwise you can still generate a full sequence of month-end dates with just a start and end
                   Month = seq.Date(from = as.Date("2018-02-01"),
                                    to = as.Date("2021-07-01"),
                                    by = "month") - 1)
    
    ## Simpler to compare month start dates than end
    dataset2[,MonthStart := as.Date(cut.Date(Month, breaks = "months"))]
    
    ## Fill NA's for Left_date with today's date to properly account for employees still present
    dataset1[,Left_date_fill := data.table::fcoalesce(Left_date, Sys.Date())]
    
    ## Create columnns with the month start dates of arrivals/departures
    dataset1[,Joint_date_month := as.Date(cut.Date(Joint_date, breaks = "months"))]
    dataset1[,Left_date_fill_month := as.Date(cut.Date(Left_date_fill, breaks = "months"))]
    
    ## Use a non-equijoin to summarize the number of employees present by month
    dataset2[dataset1, Nr_employees := .N, by = .(Organisation,
                                                  Month), on = .(Organisation = Organisation,
                                                                 MonthStart >= Joint_date_month,
                                                                 MonthStart <= Left_date_fill_month)]
    
    ## Using this method, the information required for `dataset3` has been added to `dataset2` instead
    print(dataset2[seq_len(5), .(Organisation, Month, Nr_employees)])
    #    Organisation      Month Nr_employees
    # 1:         A123 2018-01-31            2
    # 2:         A123 2018-02-28            3
    # 3:         A123 2018-03-31            3
    # 4:         A123 2018-04-30            3
    # 5:         A123 2018-05-31            3
    # 6:         A123 2018-06-30            3
    

    To create a summary table like dataset4, it makes the most sense to me to break up each of the steps into a separate operation:

    ## Start with a table of organizations for dataset4
    dataset4 <- data.table(Organisation = dataset1[,unique(Organisation)])
    
    ## Join on a summary of dataset2 to get the average over the window of interest
    dataset4[dataset2[,.(Avg = mean(fcoalesce(Nr_employees),0.0)), by = .(Organisation)]
             ,Average_Nr_employees := Avg, on = .(Organisation)]
    
    ## Join a summary of dataset1 counting the number that joined in the window of interest
    dataset4[dataset1[Joint_date_month >= dataset2[,min(MonthStart)]
                      & Joint_date_month <= dataset2[,max(MonthStart)]
                      , .(N = .N)
                      , by = .(Organisation)], Nr_employees_joined := N, on = .(Organisation)]
    
    ## Join a summary of dataset1 counting the number that left in the window of interest
    dataset4[dataset1[Left_date_fill_month >= dataset2[,min(MonthStart)]
                      & Left_date_fill_month <= dataset2[,max(MonthStart)]
                      , .(N = .N)
                      , by = .(Organisation)], Nr_employees_left := N, on = .(Organisation)]
    
    ## Join a summary of dataset1 counting the number that joined before and left after window of interest
    dataset4[dataset1[Joint_date_month <= dataset2[,min(MonthStart)]
                      & Left_date_fill_month >= dataset2[,max(MonthStart)]
                      , .(N = .N)
                      , by = .(Organisation)], Nr_employees_stayed := N, on = .(Organisation)]
    
    print(dataset4)
    #    Organisation Average_Nr_employees Nr_employees_joined Nr_employees_left Nr_employees_stayed
    # 1:         A123             2.738095                   2                 1                   1
    # 2:         A333             3.761905                   1                 2                   3
    # 3:         A445             2.071429                  NA                 1                   2