Search code examples
rdata-cleaning

How To Remove Duplicate Rows By Column's Value in R?


I have the data frame below


    Email          | Action   | Time 
    -------------- | -------- | -------------------
    [email protected] | checkin  | 2016-07-11 08:50:59
    [email protected] | checkin  | 2016-07-11 08:51:42
    [email protected] | checkin  | 2016-07-11 08:55:13 -> this is duplicate data
    [email protected] | checkout | 2016-07-11 19:15:20
    [email protected] | checkin  | 2016-07-12 08:58:20
    [email protected] | checkin  | 2016-07-12 09:15:00
    [email protected] | checkin  | 2016-07-12 09:20:13 -> this is duplicate data
    [email protected] | checkout | 2016-07-12 20:20:10

I need to remove the duplicate rows with the same action value (for this case are checkin and checkout) within the same date (YYYY-MM-DD). It should be like this:


    Email          | Action   | Time 
    -------------- | -------- | -------------------
    [email protected] | checkin  | 2016-07-11 08:50:59
    [email protected] | checkin  | 2016-07-11 08:51:42
    [email protected] | checkout | 2016-07-11 19:15:20
    [email protected] | checkin  | 2016-07-12 08:58:20
    [email protected] | checkin  | 2016-07-12 09:15:00
    [email protected] | checkout | 2016-07-12 20:20:10

I know I can substr the Time column to get the YYYY-MM-DD, but I don't have any idea how to remove the duplicate rows. Thanks in advance for your help :)


Solution

  • First create a column called Days to keep dates in the format YYYY-d-m. Then use the combination of Email, Action and Days to find the duplicates and remove them.

    ##Create the column
    flights_info$Days <- as.POSIXct(strftime(flights_info$Time, format = "%Y-%d-%m %H:%M:%S"), format="%Y-%d-%m")
    
    ##Remove the duplicates
    flights_info[!duplicated(flights_info[,c("Email","Action","Days")]),]
    

    Desired output:

    Email          Action   Time                Days      
     [email protected] checkin  2016-07-11 08:50:59 2016-07-11
     [email protected] checkin  2016-07-11 08:51:42 2016-07-11
     [email protected] checkout 2016-07-11 19:15:20 2016-07-11
     [email protected] checkin  2016-07-12 08:58:20 2016-07-12
     [email protected] checkin  2016-07-12 09:15:00 2016-07-12
     [email protected] checkout 2016-07-12 20:20:10 2016-07-12
    

    If you wish not to have the Days column anymore, you can remove it with:

    flights_info[!duplicated(flights_info[,c("Email","Action","Days")]),-ncol(flights_info)]
    

    I hope this helps.