Search code examples
rtraversal

Define groups of dates based on gaps, and then find start and end dates within groups


I have a large data frame which can be grouped by different customer IDs (ID). Each ID has several visiting dates (VisitingTime). If there is a gap of 45 five days between visits within an ID, I want to define it as new project. Then I need to find the start and end date for each project, within each ID. Below is my code to find the start and end dates, but what is the professional form of this code in R?

(x is a customer records example) For instance having a customer as follows:

x:
             ID VisitingTime
2  Customer_001   2011-09-01
3  Customer_001   2011-09-22
4  Customer_001   2011-10-25
5  Customer_001   2011-11-29
6  Customer_001   2011-12-20
7  Customer_001   2012-01-13
8  Customer_001   2012-02-03
9  Customer_001   2012-02-24
10 Customer_001   2013-07-24
11 Customer_001   2013-08-08
12 Customer_001   2013-08-29
13 Customer_001   2013-09-12
14 Customer_001   2013-10-03
15 Customer_001   2013-10-24

I need to have:

> start
[1] "2011-09-01" "2013-07-24"
> end
[1] "2012-02-24"  "2013-10-24"

My code:

start <- x[1,2]
end <- x[nrow(x),2]

for (i in 1:(nrow(x)-1)){
  if (difftime(x[i+1,2], x[i,2] , units = "days") >  45){
    end <- c(x[i,2],end)
    start <- c(start ,x[i+1,2])
  }  
}

dput(x)
structure(list(ID = c("Customer_001", "Customer_001", "Customer_001",
"Customer_001", "Customer_001", "Customer_001", "Customer_001",
"Customer_001", "Customer_001", "Customer_001", "Customer_001",
"Customer_001", "Customer_001", "Customer_001"), VisitingTime = structure(c(1314835200,
1316649600, 1319500800, 1322524800, 1324339200, 1326412800, 1328227200,
1330041600, 1374624000, 1375920000, 1377734400, 1378944000, 1380758400,
1382572800), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("ID",
"VisitingTime"), row.names = 2:15, class = "data.frame")

Solution

  • I'd use the following dplyr one-liner:

    > require(dplyr)
    > x %.% group_by(ID) %.% 
      mutate(visit=cumsum(c(Inf,diff(VisitingTime))>45)) %.% 
      group_by(ID, visit) %.% summarise(end=max(VisitingTime),start=min(VisitingTime))
    

    which produces a data frame:

                ID visit        end      start
    1 Customer_001     1 2012-02-24 2011-09-01
    2 Customer_001     2 2013-10-24 2013-07-24
    3 Customer_002     1 2012-02-24 2011-09-01
    4 Customer_002     2 2013-10-24 2013-07-24
    

    note I've tested it on a data frame with more than one customer ID to make sure the first part works.

    How does it work? Well, start with your data and do the following, printing out x after each step:

    x$DT = c(Inf, diff(x$VisitingTime))
    x$begin = x$DT>45
    x$visit = cumsum(x$begin)
    

    and you should see that x$visit groups each visit.

    The one-liner just does all this using dplyr, and then goes on to get the min and max data within each visit group.

    As a further test I've just checked what happens if I test the difference as 1 day, in which case I get one visit for each record, and also if the differences is over 9000 days, in which case I get just one visit record. (I've also just fixed a silly mistake where I called start the max date and vice versa)