Search code examples
rdplyr

Find duplicates across groups in R with dplyr


Suppose I have the following data

workID <- c("A1", "A1", "B1", "C1", "C1", "C1", "D1", "A1")
Employee <- c(12, 22, 31, 90, 108, 17, 23, 56)
FY <- c(2019, 2019, 2019, 2020, 2020, 2020, 2021, 2021)
Office <- c("HQ", "HQ", "Tulsa", "Dallas", "Dallas", "Dallas", "Cleveland", "HQ")
Hours <- c(100, 200, 100, 150, 300, 275, 600, 700)

data <- data.frame(workID, Employee, FY, Office, Hours)

So that data now contains

workID Employee   FY    Office Hours
1     A1       12 2019        HQ   100
2     A1       22 2019        HQ   200
3     B1       31 2019     Tulsa   100
4     C1       90 2020    Dallas   150
5     C1      108 2020    Dallas   300
6     C1       17 2020    Dallas   275
7     D1       23 2021 Cleveland   600
8     A1       56 2021        HQ   700

In my actual there are a lot more columns and 2.6M rows...

Before I go into creating summary stats, I would first like to identify cases where the same WorkID has been used in multiple years, as that shouldn't happen. It's okay for it to appear multiple times in the same year (each employee on a workID has their own row) but the same workID should not be used again in a different year. So in the above data, A1 should be identified as having been used in 2019 and 2021.

Where I'm headed is that I want to collapse things down to one row per workID with new columns for the number of employees that worked on that workID and the total hours spent. My actual code (names don't match above, but you get the idea) looks like

inspections <- consolidated %>% filter(Operation.Code.Desc %in% inspection)
num_inspections <- length(unique(inspections$Work.Accomplishment.ID))
#79,521

inspections2 <- consolidated %>% select(Operation.Date.FY,
                                               Work.Accomplishment.ID, 
                                               Program.Area.Abbrv,
                                               Operation.Code.Desc, 
                                               Total.Hours.Spent) %>%
  filter(Operation.Code.Desc %in% inspection) %>%
  group_by(Operation.Date.FY, Program.Area.Abbrv, Operation.Code.Desc) %>%
  summarise(personnel=n(), Total.Time = sum(Total.Hours.Spent),
            ops = n_distinct(Work.Accomplishment.ID))
sum(inspections2$ops)
#79,805  -- why not matching 79,521

As you can see there's a discrepancy that I can't work out, and I'm speculating there might be some re-use within the large dataset...grasping at straws here. My total hours match up between the two dataframes above, but not the number of inspections.

Sorry I can't post actual data here...it's huge and proprietary.


Solution

  • Simply

    library(dplyr)
    
    workID <- c("A1", "A1", "B1", "C1", "C1", "C1", "D1", "A1")
    Employee <- c(12, 22, 31, 90, 108, 17, 23, 56)
    FY <- c(2019, 2019, 2019, 2020, 2020, 2020, 2021, 2021)
    Office <- c("HQ", "HQ", "Tulsa", "Dallas", "Dallas", "Dallas", "Cleveland", "HQ")
    Hours <- c(100, 200, 100, 150, 300, 275, 600, 700)
    
    data <- data.frame(workID, Employee, FY, Office, Hours)
    
    data %>% 
      filter(n_distinct(FY) > 1, .by = workID)
    #>   workID Employee   FY Office Hours
    #> 1     A1       12 2019     HQ   100
    #> 2     A1       22 2019     HQ   200
    #> 3     A1       56 2021     HQ   700
    

    Created on 2024-04-05 with reprex v2.1.0