Suppose I have the following data
workID <- c("A1", "A1", "B1", "C1", "C1", "C1", "D1", "A1")
Employee <- c(12, 22, 31, 90, 108, 17, 23, 56)
FY <- c(2019, 2019, 2019, 2020, 2020, 2020, 2021, 2021)
Office <- c("HQ", "HQ", "Tulsa", "Dallas", "Dallas", "Dallas", "Cleveland", "HQ")
Hours <- c(100, 200, 100, 150, 300, 275, 600, 700)
data <- data.frame(workID, Employee, FY, Office, Hours)
So that data
now contains
workID Employee FY Office Hours
1 A1 12 2019 HQ 100
2 A1 22 2019 HQ 200
3 B1 31 2019 Tulsa 100
4 C1 90 2020 Dallas 150
5 C1 108 2020 Dallas 300
6 C1 17 2020 Dallas 275
7 D1 23 2021 Cleveland 600
8 A1 56 2021 HQ 700
In my actual there are a lot more columns and 2.6M rows...
Before I go into creating summary stats, I would first like to identify cases where the same WorkID has been used in multiple years, as that shouldn't happen. It's okay for it to appear multiple times in the same year (each employee on a workID has their own row) but the same workID should not be used again in a different year. So in the above data, A1 should be identified as having been used in 2019 and 2021.
Where I'm headed is that I want to collapse things down to one row per workID with new columns for the number of employees that worked on that workID and the total hours spent. My actual code (names don't match above, but you get the idea) looks like
inspections <- consolidated %>% filter(Operation.Code.Desc %in% inspection)
num_inspections <- length(unique(inspections$Work.Accomplishment.ID))
#79,521
inspections2 <- consolidated %>% select(Operation.Date.FY,
Work.Accomplishment.ID,
Program.Area.Abbrv,
Operation.Code.Desc,
Total.Hours.Spent) %>%
filter(Operation.Code.Desc %in% inspection) %>%
group_by(Operation.Date.FY, Program.Area.Abbrv, Operation.Code.Desc) %>%
summarise(personnel=n(), Total.Time = sum(Total.Hours.Spent),
ops = n_distinct(Work.Accomplishment.ID))
sum(inspections2$ops)
#79,805 -- why not matching 79,521
As you can see there's a discrepancy that I can't work out, and I'm speculating there might be some re-use within the large dataset...grasping at straws here. My total hours match up between the two dataframes above, but not the number of inspections.
Sorry I can't post actual data here...it's huge and proprietary.
Simply
library(dplyr)
workID <- c("A1", "A1", "B1", "C1", "C1", "C1", "D1", "A1")
Employee <- c(12, 22, 31, 90, 108, 17, 23, 56)
FY <- c(2019, 2019, 2019, 2020, 2020, 2020, 2021, 2021)
Office <- c("HQ", "HQ", "Tulsa", "Dallas", "Dallas", "Dallas", "Cleveland", "HQ")
Hours <- c(100, 200, 100, 150, 300, 275, 600, 700)
data <- data.frame(workID, Employee, FY, Office, Hours)
data %>%
filter(n_distinct(FY) > 1, .by = workID)
#> workID Employee FY Office Hours
#> 1 A1 12 2019 HQ 100
#> 2 A1 22 2019 HQ 200
#> 3 A1 56 2021 HQ 700
Created on 2024-04-05 with reprex v2.1.0