I have the data frame below
Email | Action | Time -------------- | -------- | ------------------- [email protected] | checkin | 2016-07-11 08:50:59 [email protected] | checkin | 2016-07-11 08:51:42 [email protected] | checkin | 2016-07-11 08:55:13 -> this is duplicate data [email protected] | checkout | 2016-07-11 19:15:20 [email protected] | checkin | 2016-07-12 08:58:20 [email protected] | checkin | 2016-07-12 09:15:00 [email protected] | checkin | 2016-07-12 09:20:13 -> this is duplicate data [email protected] | checkout | 2016-07-12 20:20:10
I need to remove the duplicate rows with the same action value (for this case are checkin
and checkout
) within the same date (YYYY-MM-DD
). It should be like this:
Email | Action | Time -------------- | -------- | ------------------- [email protected] | checkin | 2016-07-11 08:50:59 [email protected] | checkin | 2016-07-11 08:51:42 [email protected] | checkout | 2016-07-11 19:15:20 [email protected] | checkin | 2016-07-12 08:58:20 [email protected] | checkin | 2016-07-12 09:15:00 [email protected] | checkout | 2016-07-12 20:20:10
I know I can substr
the Time column to get the YYYY-MM-DD
, but I don't have any idea how to remove the duplicate rows. Thanks in advance for your help :)
First create a column called Days
to keep dates in the format YYYY-d-m
. Then use the combination of Email
, Action
and Days
to find the duplicates and remove them.
##Create the column
flights_info$Days <- as.POSIXct(strftime(flights_info$Time, format = "%Y-%d-%m %H:%M:%S"), format="%Y-%d-%m")
##Remove the duplicates
flights_info[!duplicated(flights_info[,c("Email","Action","Days")]),]
Desired output:
Email Action Time Days
[email protected] checkin 2016-07-11 08:50:59 2016-07-11
[email protected] checkin 2016-07-11 08:51:42 2016-07-11
[email protected] checkout 2016-07-11 19:15:20 2016-07-11
[email protected] checkin 2016-07-12 08:58:20 2016-07-12
[email protected] checkin 2016-07-12 09:15:00 2016-07-12
[email protected] checkout 2016-07-12 20:20:10 2016-07-12
If you wish not to have the Days
column anymore, you can remove it with:
flights_info[!duplicated(flights_info[,c("Email","Action","Days")]),-ncol(flights_info)]
I hope this helps.