Search code examples
rdatetime-seriesgroupingposixlt

Selecting and grouping similar dates from vectors of dates


I have three vectors of dates in POSIX format that correspond with data collection times from three large datasets. Each of these vectors is of a different length and have similar (but not identical) dates.

I would like to:

  1. group these dates into specified time ranges, e.g. group dates from each vector that fall in a 30-day window and
  2. reduce the number of date groupings to reflect the dataset with the smallest number of collection times, e.g. if "dataset A" has three sampling dates and "dataset B" has five sampling dates then there would only be three groupings of dates (unless the two extra dates in "dataset B" fall within 30 days of the dates in "dataset A").

An example with three vectors of dates in POSIX format (I want to group similar dates between the vectors, allowing a time window of 30 days):

A.dates = as.POSIXlt(c("1998-07-24 08:00","1999-07-24 08:00","2000-07-24 08:00"), 
                     tz = "America/Los_Angeles")
B.dates = as.POSIXlt(c("1998-07-25 08:00","1999-07-25 08:00","2000-07-25 08:00"), 
                     tz = "America/Los_Angeles")
C.dates = as.POSIXlt(c("1998-07-26 08:00","1999-07-26 08:00","2000-07-26 08:00","2000-08-29"), 
                     tz = "America/Los_Angeles") 

Specifying a time window of 30 days, there would be three date groupings (the sampling dates from July 1998, 1999, and 2000). The C.dates vector has a fourth collection date of August 29th, 2000 which would be excluded from the groupings because:

  1. it is not within 30 days of the July dates in the other vectors and
  2. there are no dates in the other two vectors that fall within 30 days of August 29th, 2000.

Solution

  • You could loop over each element of each vector and create sequences ± 15 days

    L <- list(A.dates, B.dates, C.dates)
    tmp <- lapply(L, function(x) lapply(x, function(x) 
      do.call(seq, c(as.list(as.Date(x) + c(-15, 15)), "day"))))
    

    and unionite them in the list.

    tmp <- lapply(tmp, function(x) as.Date(Reduce(union, x), origin="1970-01-01"))
    

    Then simply find the intersect

    i <- Reduce(function(...) as.Date(intersect(...), origin="1970-01-01"), tmp)
    

    and select the dates accordingly.

    tmp <- lapply(L, function(x) x[as.Date(x) %in% i])
    tmp
    # [[1]]
    # [1] "1998-07-24 08:00:00 PDT" "1999-07-24 08:00:00 PDT"
    # [3] "2000-07-24 08:00:00 PDT"
    # 
    # [[2]]
    # [1] "1998-07-25 08:00:00 PDT" "1999-07-25 08:00:00 PDT"
    # [3] "2000-07-25 08:00:00 PDT"
    # 
    # [[3]]
    # [1] "1998-07-26 PDT" "1999-07-26 PDT" "2000-07-26 PDT"
    

    To sort them by year according to your comment, we first unlist them. Unfortunately this converts the dates into numerics (i.e. seconds since January, 1 1970), so we need to convert them back.

    tmp <- as.POSIXlt(unlist(lapply(tmp, as.POSIXct)), origin="1970-01-01",
                    tz="America/Los_Angeles")
    

    Finally we split the list by the first four substrings which is the year (we also could do split(tmp, strftime(tmp, "%Y")) though).

    res <- split(tmp, substr(tmp, 1, 4))
    res
    # $`1998`
    # [1] "1998-07-24 08:00:00 PDT" "1998-07-25 08:00:00 PDT"
    # [3] "1998-07-26 00:00:00 PDT"
    # 
    # $`1999`
    # [1] "1999-07-24 08:00:00 PDT" "1999-07-25 08:00:00 PDT"
    # [3] "1999-07-26 00:00:00 PDT"
    # 
    # $`2000`
    # [1] "2000-07-24 08:00:00 PDT" "2000-07-25 08:00:00 PDT"
    # [3] "2000-07-26 00:00:00 PDT"