Search code examples
rduplicatesdataset

Add count of duplicate observations to data frame


I am trying to get all dupicated observations. I was looking but all solutions seems to give for columns. Is it possible get the entire rows?

My dataset looks like this

structure(list(CrimeId = c(160903280L, 160912272L, 160912590L, 
160912801L, 160912811L, 160913003L), OriginalCrimeTypeName = c("Assault / Battery", 
"Homeless Complaint", "Susp Info", "Report", "594", "Ref'd"), 
    OffenseDate = c("2016-03-30T00:00:00", "2016-03-31T00:00:00", 
    "2016-03-31T00:00:00", "2016-03-31T00:00:00", "2016-03-31T00:00:00", 
    "2016-03-31T00:00:00"), CallTime = c("18:42", "15:31", "16:49", 
    "17:38", "17:42", "18:29"), CallDateTime = c("2016-03-30T18:42:00", 
    "2016-03-31T15:31:00", "2016-03-31T16:49:00", "2016-03-31T17:38:00", 
    "2016-03-31T17:42:00", "2016-03-31T18:29:00"), Disposition = c("REP", 
    "GOA", "GOA", "GOA", "REP", "GOA"), Address = c("100 Block Of Chilton Av", 
    "2300 Block Of Market St", "2300 Block Of Market St", "500 Block Of 7th St", 
    "Beale St/bryant St", "16th St/pond St"), City = c("San Francisco", 
    "San Francisco", "San Francisco", "San Francisco", "San Francisco", 
    "San Francisco"), State = c("CA", "CA", "CA", "CA", "CA", 
    "CA"), AgencyId = c("1", "1", "1", "1", "1", "1"), Range = c(NA, 
    NA, NA, NA, NA, NA), AddressType = c("Premise Address", "Premise Address", 
    "Premise Address", "Premise Address", "Intersection", "Intersection"
    )), row.names = c(NA, 6L), class = "data.frame")

Solution

  • With dplyr try group_by_all or the now recommended group_by(across(everything())) equivalent. Using a slightly extended data set where I created a duplicated entry (row 2 and 5).

    library(dplyr)
    
    df %>% 
      group_by(across(everything())) %>% 
      mutate(dup = n())
    ...AgencyId Range AddressType       dup
    ...  <chr>    <lgl> <chr>           <int>
    ...1 1        NA    Premise Address     1
    ...2 1        NA    Premise Address     2
    ...3 1        NA    Premise Address     1
    ...4 1        NA    Premise Address     1
    ...5 1        NA    Premise Address     2
    ...6 1        NA    Intersection        1
    ...7 1        NA    Intersection        1
    

    (only showing the last 4 columns)

    ext. data

    df <- structure(list(CrimeId = c(160903280L, 160912272L, 160912590L,
    160912801L, 160912272L, 160912811L, 160913003L), OriginalCrimeTypeName = c("Assault / Battery",
    "Homeless Complaint", "Susp Info", "Report", "Homeless Complaint",
    "594", "Ref'd"), OffenseDate = c("2016-03-30T00:00:00", "2016-03-31T00:00:00",
    "2016-03-31T00:00:00", "2016-03-31T00:00:00", "2016-03-31T00:00:00",
    "2016-03-31T00:00:00", "2016-03-31T00:00:00"), CallTime = c("18:42",
    "15:31", "16:49", "17:38", "15:31", "17:42", "18:29"), CallDateTime = c("2016-03-30T18:42:00",
    "2016-03-31T15:31:00", "2016-03-31T16:49:00", "2016-03-31T17:38:00",
    "2016-03-31T15:31:00", "2016-03-31T17:42:00", "2016-03-31T18:29:00"
    ), Disposition = c("REP", "GOA", "GOA", "GOA", "GOA", "REP",
    "GOA"), Address = c("100 Block Of Chilton Av", "2300 Block Of Market St",
    "2300 Block Of Market St", "500 Block Of 7th St", "2300 Block Of Market St",
    "Beale St/bryant St", "16th St/pond St"), City = c("San Francisco",
    "San Francisco", "San Francisco", "San Francisco", "San Francisco",
    "San Francisco", "San Francisco"), State = c("CA", "CA", "CA",
    "CA", "CA", "CA", "CA"), AgencyId = c("1", "1", "1", "1", "1",
    "1", "1"), Range = c(NA, NA, NA, NA, NA, NA, NA), AddressType = c("Premise Address",
    "Premise Address", "Premise Address", "Premise Address", "Premise Address",
    "Intersection", "Intersection")), row.names = c("1", "2", "3",
    "4", "21", "5", "6"), class = "data.frame")