Search code examples
rrandomtime-seriessamplelubridate

R - Sample consecutive series of dates in time series without replacement?


I have a data frame in R containing a series of dates. The earliest date is (ISO format) 2015-03-22 and the latest date is 2016-01-03, but there are two breaks within the data. Here is what it looks like:

library(tidyverse)
library(lubridate)

date_data <- tibble(dates = c(seq(ymd("2015-03-22"),
                                  ymd("2015-07-03"),
                                  by = "days"),
                              seq(ymd("2015-08-09"),
                                  ymd("2015-10-01"),
                                  by = "days"),
                              seq(ymd("2015-11-12"),
                                  ymd("2016-01-03"),
                                  by = "days")),
                    sample_id = 0L)

I.e.:

> date_data
# A tibble: 211 x 2
   dates      sample_id
   <date>         <int>
 1 2015-03-22         0
 2 2015-03-23         0
 3 2015-03-24         0
 4 2015-03-25         0
 5 2015-03-26         0
 6 2015-03-27         0
 7 2015-03-28         0
 8 2015-03-29         0
 9 2015-03-30         0
10 2015-03-31         0
# … with 201 more rows

What I want to do is to take ten 10-day long samples of continous dates from within that time series without replacement. For example, a valid sample would be the ten days from 2015-04-01 to 2015-04-10 because that falls completely within the dates column in my date_data data frame. Each sample would then get a unique (non-zero) number in the sample_id column in date_data such as 1:10.

To be clear, my requirements are:

  1. Each sample would be 10 consecutive days.

  2. The sampling has to be without replacement. So if sample_id == 1 is the 2015-04-01 to 2015-04-10 period, those dates can't be part of another 10-day-long sample.

  3. Each 10-day-long sample can't include any date that's not within date_data$dates.

At the end, date_data$sample_id would have unique numbers representing each 10-day-long sample, likely with lots of 0s left over that were not part of any sample (and there would be 200 rows - 10 for each sample - where sample_id != 0).

I am aware of dplyr::sample_n() but it doesn't sample consecutive values, and I don't know how to devise a way to "remember" which dates have already been sampled...

What's a good way to do this? A for loop?!?! Or perhaps something with purrr? Thank you very much for your help.

UPDATE: Thanks to @gfgm's solution, it reminded me that performance is an important consideration. My real dataset is quite a bit larger, and in some cases I would want to take 20+ samples instead of just 10. Ideally the size of the sample can be changed as well, i.e. not necessarily 10-days long.


Solution

  • This is tricky, as you anticipated, because of the requirement of sampling without replacement. I have a working solution below which achieves a random sample and works fast on a problem of the scale given in your toy example. It should also be fine with more observations, but will get really really slow if you need to pick a lot of points relative to the sample size.

    The basic premise is to pick n=10 points, generate the 10 vectors from these points forwards, and if the vectors overlap ditch them and pick again. This is simple and works fine given that 10*n << nrow(df). If you wanted to get 15 subvectors out of your 200 observations this would be a good deal slower.

    library(tidyverse)
    library(lubridate)
    
    date_data <- tibble(dates = c(seq(ymd("2015-03-22"),
                                      ymd("2015-07-03"),
                                      by = "days"),
                                  seq(ymd("2015-08-09"),
                                      ymd("2015-10-01"),
                                      by = "days"),
                                  seq(ymd("2015-11-12"),
                                      ymd("2016-01-03"),
                                      by = "days")),
                        sample_id = 0L)
    
    # A function that picks n indices, projects them forward 10,
    # and if any of the segments overlap resamples
    pick_n_vec <- function(df, n = 10, out = 10) {
      points <- sample(nrow(df) - (out - 1), n, replace = F)
      vecs <- lapply(points, function(i){i:(i+(out - 1))})
    
      while (max(table(unlist(vecs))) > 1) {
        points <- sample(nrow(df) - (out - 1), n, replace = F)
        vecs <- lapply(points, function(i){i:(i+(out - 1))})
      }
    
      vecs
     }
    
    # demonstrate
    set.seed(42)
    indices <- pick_n_vec(date_data)
    
    for (i in 1:10) {
      date_data$sample_id[indices[[i]]] <- i
    }
    
    date_data[indices[[1]], ]
    #> # A tibble: 10 x 2
    #>         dates sample_id
    #>        <date>     <int>
    #>  1 2015-05-31         1
    #>  2 2015-06-01         1
    #>  3 2015-06-02         1
    #>  4 2015-06-03         1
    #>  5 2015-06-04         1
    #>  6 2015-06-05         1
    #>  7 2015-06-06         1
    #>  8 2015-06-07         1
    #>  9 2015-06-08         1
    #> 10 2015-06-09         1
    table(date_data$sample_id)
    #> 
    #>   0   1   2   3   4   5   6   7   8   9  10 
    #> 111  10  10  10  10  10  10  10  10  10  10
    

    Created on 2019-01-16 by the reprex package (v0.2.1)

    marginally faster version

    pick_n_vec2 <- function(df, n = 10, out = 10) {
      points <- sample(nrow(df) - (out - 1), n, replace = F)
      while (min(diff(sort(points))) < 10) {
        points <- sample(nrow(df) - (out - 1), n, replace = F)
      }
      lapply(points, function(i){i:(i+(out - 1))})
    }