I have a large data frame which can be grouped by different customer IDs (ID). Each ID has several visiting dates (VisitingTime). If there is a gap of 45 five days between visits within an ID, I want to define it as new project. Then I need to find the start and end date for each project, within each ID. Below is my code to find the start and end dates, but what is the professional form of this code in R?
(x is a customer records example) For instance having a customer as follows:
x:
ID VisitingTime
2 Customer_001 2011-09-01
3 Customer_001 2011-09-22
4 Customer_001 2011-10-25
5 Customer_001 2011-11-29
6 Customer_001 2011-12-20
7 Customer_001 2012-01-13
8 Customer_001 2012-02-03
9 Customer_001 2012-02-24
10 Customer_001 2013-07-24
11 Customer_001 2013-08-08
12 Customer_001 2013-08-29
13 Customer_001 2013-09-12
14 Customer_001 2013-10-03
15 Customer_001 2013-10-24
I need to have:
> start
[1] "2011-09-01" "2013-07-24"
> end
[1] "2012-02-24" "2013-10-24"
My code:
start <- x[1,2]
end <- x[nrow(x),2]
for (i in 1:(nrow(x)-1)){
if (difftime(x[i+1,2], x[i,2] , units = "days") > 45){
end <- c(x[i,2],end)
start <- c(start ,x[i+1,2])
}
}
dput(x)
structure(list(ID = c("Customer_001", "Customer_001", "Customer_001",
"Customer_001", "Customer_001", "Customer_001", "Customer_001",
"Customer_001", "Customer_001", "Customer_001", "Customer_001",
"Customer_001", "Customer_001", "Customer_001"), VisitingTime = structure(c(1314835200,
1316649600, 1319500800, 1322524800, 1324339200, 1326412800, 1328227200,
1330041600, 1374624000, 1375920000, 1377734400, 1378944000, 1380758400,
1382572800), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("ID",
"VisitingTime"), row.names = 2:15, class = "data.frame")
I'd use the following dplyr
one-liner:
> require(dplyr)
> x %.% group_by(ID) %.%
mutate(visit=cumsum(c(Inf,diff(VisitingTime))>45)) %.%
group_by(ID, visit) %.% summarise(end=max(VisitingTime),start=min(VisitingTime))
which produces a data frame:
ID visit end start
1 Customer_001 1 2012-02-24 2011-09-01
2 Customer_001 2 2013-10-24 2013-07-24
3 Customer_002 1 2012-02-24 2011-09-01
4 Customer_002 2 2013-10-24 2013-07-24
note I've tested it on a data frame with more than one customer ID to make sure the first part works.
How does it work? Well, start with your data and do the following, printing out x
after each step:
x$DT = c(Inf, diff(x$VisitingTime))
x$begin = x$DT>45
x$visit = cumsum(x$begin)
and you should see that x$visit
groups each visit.
The one-liner just does all this using dplyr
, and then goes on to get the min and max data within each visit group.
As a further test I've just checked what happens if I test the difference as 1 day, in which case I get one visit for each record, and also if the differences is over 9000 days, in which case I get just one visit record. (I've also just fixed a silly mistake where I called start
the max
date and vice versa)