I have the following dataset with information about employees joining and leaving an organisation:
dataset1 <- read.table(
text = "
Employee Organisation Joint_date Left_date
G223 A123 1993-05-15 2019-05-01
G223 A123 2020-04-11 NA
G233 A123 2018-02-20 NA
G234 A123 2015-09-04 NA
G111 A333 1980-10-03 2019-09-27
G122 A333 2000-11-16 NA
G177 A333 2005-01-19 NA
G330 A333 2002-12-24 NA
G556 A333 2018-05-01 2019-03-04
G555 A445 2015-11-18 NA
G556 A445 2005-09-01 2018-03-04
G557 A445 1989-04-05 NA",
header = TRUE)
dataset1$Employee <- as.factor(dataset1$Employee)
dataset1$Organisation <- as.factor(dataset1$Organisation)
dataset1$Joint_date <- as.Date(dataset1$Joint_date, format="%Y-%m-%d")
dataset1$Left_date <- as.Date(dataset1$Left_date, format="%Y-%m-%d")
I have created dataset2 (monthly dataset) that goes from 2018-01-31 up to 2021-06-30:
dataset2_dates=c("2018-01-31","2018-02-28","2018-03-31","2018-04-30","2018-05-31","2018-06-30","2018-07-31","2018-08-31","2018-09-30","2018-10-31","2018-11-30","2018-12-31","2019-01-31","2019-02-28","2019-03-31","2019-04-30","2019-05-31","2019-06-30","2019-07-31","2019-08-31","2019-09-30","2019-10-31","2019-11-30","2019-12-31","2020-01-31","2020-02-29","2020-03-31","2020-04-30","2020-05-31","2020-06-30","2020-07-31","2020-08-31","2020-09-30","2020-10-31","2020-11-30","2020-12-31","2021-01-31","2021-02-28","2021-03-31","2021-04-30","2021-05-31","2021-06-30")
# add dates
dataset2 <- expand.grid(Organisation = unique(dataset1$Organisation),
Month = dataset2_dates)
## sort
dataset2 <- dataset2[order(dataset2$Organisation, dataset2$Month),]
## reset id
rownames(dataset2) <- NULL
dataset2$Organisation <- as.factor(dataset2$Organisation)
dataset2$Month <- as.Date(dataset2$Month, format="%Y-%m-%d")
I would like to end up with the following dataset3
:
Organisation | Month | Nr_employees
A123 | 2018-01-31 | 2
A123 | 2018-02-28 | 3
A123 | 2018-03-31 | 3
A123 | 2018-04-30 | 3
A123 | 2018-05-31 | 3
A123 | 2018-06-30 | 3
A123 | 2018-07-31 | 3
A123 | 2018-08-31 | 3
A123 | 2018-09-30 | 3
A123 | 2018-10-31 | 3
A123 | 2018-11-30 | 3
A123 | 2018-12-31 | 3
A123 | 2019-01-31 | 3
A123 | 2019-02-28 | 3
A123 | 2019-03-31 | 3
A123 | 2019-04-30 | 3
A123 | 2019-05-31 | 3
A123 | 2019-06-30 | 2
A123 | 2019-07-31 | 2
A123 | 2019-08-31 | 2
A123 | 2019-09-30 | 2
A123 | 2019-10-31 | 2
A123 | 2019-11-30 | 2
A123 | 2019-12-31 | 2
A123 | 2020-01-31 | 2
A123 | 2020-02-29 | 2
A123 | 2020-03-31 | 2
A123 | 2020-04-30 | 3
A123 | 2020-05-31 | 3
A123 | 2020-06-30 | 3
A123 | 2020-07-31 | 3
A123 | 2020-08-31 | 3
A123 | 2020-09-30 | 3
A123 | 2020-10-31 | 3
A123 | 2020-11-30 | 3
A123 | 2020-12-31 | 3
A123 | 2021-01-31 | 3
A123 | 2021-02-28 | 3
A123 | 2021-03-31 | 3
A123 | 2021-04-30 | 3
A123 | 2021-05-31 | 3
A123 | 2021-06-30 | 3
.....
Note: If an employee joins on the last day of the month or leaves on the first day of the month, it still counts as if the employee was there the whole month.
And dataset4
that summarises data from 2018-01-31 to 2021-06-30:
Organisation | Average Nr_employees | Nr_employees joined | Nr_employess left | Nr_employess stayed the whole time
A123 | 115/42 = 2.7 | 2 | 1 | 1
....
Any ideas on how to generate dataset3
and dataset4
?
I prefer to work with the data.table
package - for problems like creating dataset3
, the non-equijoin functionality is a great fit.
library(data.table)
setDT(dataset1)
dataset2 <- CJ(Organisation = dataset1[,unique(Organisation)],
## This is an option to generate the month sequence based on the first date in dataset1 to present
# Month = seq.Date(from = as.Date(cut.Date(dataset1[,min(Joint_date)], breaks = "months")),
# to = as.Date(cut.Date(Sys.Date(), breaks = "months")),
# by = "month") - 1
## Otherwise you can still generate a full sequence of month-end dates with just a start and end
Month = seq.Date(from = as.Date("2018-02-01"),
to = as.Date("2021-07-01"),
by = "month") - 1)
## Simpler to compare month start dates than end
dataset2[,MonthStart := as.Date(cut.Date(Month, breaks = "months"))]
## Fill NA's for Left_date with today's date to properly account for employees still present
dataset1[,Left_date_fill := data.table::fcoalesce(Left_date, Sys.Date())]
## Create columnns with the month start dates of arrivals/departures
dataset1[,Joint_date_month := as.Date(cut.Date(Joint_date, breaks = "months"))]
dataset1[,Left_date_fill_month := as.Date(cut.Date(Left_date_fill, breaks = "months"))]
## Use a non-equijoin to summarize the number of employees present by month
dataset2[dataset1, Nr_employees := .N, by = .(Organisation,
Month), on = .(Organisation = Organisation,
MonthStart >= Joint_date_month,
MonthStart <= Left_date_fill_month)]
## Using this method, the information required for `dataset3` has been added to `dataset2` instead
print(dataset2[seq_len(5), .(Organisation, Month, Nr_employees)])
# Organisation Month Nr_employees
# 1: A123 2018-01-31 2
# 2: A123 2018-02-28 3
# 3: A123 2018-03-31 3
# 4: A123 2018-04-30 3
# 5: A123 2018-05-31 3
# 6: A123 2018-06-30 3
To create a summary table like dataset4
, it makes the most sense to me to break up each of the steps into a separate operation:
## Start with a table of organizations for dataset4
dataset4 <- data.table(Organisation = dataset1[,unique(Organisation)])
## Join on a summary of dataset2 to get the average over the window of interest
dataset4[dataset2[,.(Avg = mean(fcoalesce(Nr_employees),0.0)), by = .(Organisation)]
,Average_Nr_employees := Avg, on = .(Organisation)]
## Join a summary of dataset1 counting the number that joined in the window of interest
dataset4[dataset1[Joint_date_month >= dataset2[,min(MonthStart)]
& Joint_date_month <= dataset2[,max(MonthStart)]
, .(N = .N)
, by = .(Organisation)], Nr_employees_joined := N, on = .(Organisation)]
## Join a summary of dataset1 counting the number that left in the window of interest
dataset4[dataset1[Left_date_fill_month >= dataset2[,min(MonthStart)]
& Left_date_fill_month <= dataset2[,max(MonthStart)]
, .(N = .N)
, by = .(Organisation)], Nr_employees_left := N, on = .(Organisation)]
## Join a summary of dataset1 counting the number that joined before and left after window of interest
dataset4[dataset1[Joint_date_month <= dataset2[,min(MonthStart)]
& Left_date_fill_month >= dataset2[,max(MonthStart)]
, .(N = .N)
, by = .(Organisation)], Nr_employees_stayed := N, on = .(Organisation)]
print(dataset4)
# Organisation Average_Nr_employees Nr_employees_joined Nr_employees_left Nr_employees_stayed
# 1: A123 2.738095 2 1 1
# 2: A333 3.761905 1 2 3
# 3: A445 2.071429 NA 1 2