Search code examples
rdataframeslicebind-rows

How to remove repeated rows based on last date in r?


I have a large dataframe with many columns and rows, where lot are my sampling units (the same lot can have more than one line of information). However, new field visits are frequently made and a new table is generated, with information that may be the same, changing only the sampling date, or small changes and/or even changes in the sample size. In this case, the two tables are joined and there may be duplication of the same sampling unit, in this case I need to exclude the oldest row based on the sampling date, but I need to do this per group. I've already made some attempts, but the code doesn't work very well, sometimes it leaves old lines or repeated lines.

  • Replicable example

    df1=data.frame(
    block=c(1,1,2,2,2,3,4,5,6,6,6,6),
    units=c(1,2,1,2,3,1,1,1,1,2,3,4),
    Sector=c(1,1,1,1,1,3,3,3,3,3,3,3),
    lot=c(25, 25,18 ,18 ,18 ,9,50,110,200,200,200,200),
    date= c("2019","2019","2023/10/24","2023/10/24","2023/10/24",
        "2023/09/24","2023/08/24","2023/10/24","2019", 
    "2019","2019","2019")
    )
    
    
    df2=data.frame(
    block=c(2, 2, 2, 4, 6, 6, 6,6 ),
    units=c(1, 2, 3, 1, 1, 2, 3,3 ),
    Sector=c(1,1,1,3,3,3,3,3),
    lot=c(18 ,18 ,18,50,200,200,200, 200),
    date= c("2023/12/24", "2023/12/24", "2023/12/24","2023/12/24",      
         "2023/11/24" ,   "2023/11/24",   "2023/11/24", "2023/11/24")
    )
    
    df_join<-rbind(df1, df2)
    
    library(tidyverse)
    
    df_clean<-df_join %>%
    group_by(Sector, block, lot,units ) %>%
    slice_max(date)
    

enter image description here


Solution

  • The exsiting answering still has a duplicate row in the last.

    The biggest problem here is that your date column is not regular, which is a really bad practice. Therefore, we can first try to correct it by considering 2019 as 2019/01/01 or whatever consistent as you don't have the valid data, and convert it to date or date time.

    df_join['date']%<>%apply(1,FUN= function(x){if(!x%>%str_detect('\\/')) 
      x%>%str_glue("/01/01") 
      else x})%>%
      as.Date()
    #OR
    df_join$date = as.character(df_join$date)
    df_join%<>%
      mutate(date = if_else(!date%>%str_detect('\\/'), date%>%paste0("/01/01"), date)%>%as.Date())
    ##track change, both should result in
    > df_join$date
    
     [1] "2019-01-01" "2019-01-01" "2023-10-24" "2023-10-24" "2023-10-24" "2023-09-24" "2023-08-24" "2023-10-24" "2019-01-01"
    [10] "2019-01-01" "2019-01-01" "2019-01-01" "2023-12-24" "2023-12-24" "2023-12-24" "2023-12-24" "2023-11-24" "2023-11-24"
    [19] "2023-11-24" "2023-11-24"
    

    Now, use the last chunk of the code your have with adding %>% distinct() in the end, you can get the most recent record you want.

    df_clean<-df_join %>% group_by(Sector, block, lot,units ) %>% slice_max(date) %>% distinct()

    > df_clean
    # A tibble: 12 x 5
    # Groups:   Sector, block, lot, units [12]
       block units Sector   lot date      
       <dbl> <dbl>  <dbl> <dbl> <date>    
     1     1     1      1    25 2019-01-01
     2     1     2      1    25 2019-01-01
     3     2     1      1    18 2023-12-24
     4     2     2      1    18 2023-12-24
     5     2     3      1    18 2023-12-24
     6     3     1      3     9 2023-09-24
     7     4     1      3    50 2023-12-24
     8     5     1      3   110 2023-10-24
     9     6     1      3   200 2023-11-24
    10     6     2      3   200 2023-11-24
    11     6     3      3   200 2023-11-24
    12     6     4      3   200 2019-01-01