I have a data frame in R containing a series of dates. The earliest date is (ISO format) 2015-03-22 and the latest date is 2016-01-03, but there are two breaks within the data. Here is what it looks like:
library(tidyverse)
library(lubridate)
date_data <- tibble(dates = c(seq(ymd("2015-03-22"),
ymd("2015-07-03"),
by = "days"),
seq(ymd("2015-08-09"),
ymd("2015-10-01"),
by = "days"),
seq(ymd("2015-11-12"),
ymd("2016-01-03"),
by = "days")),
sample_id = 0L)
I.e.:
> date_data
# A tibble: 211 x 2
dates sample_id
<date> <int>
1 2015-03-22 0
2 2015-03-23 0
3 2015-03-24 0
4 2015-03-25 0
5 2015-03-26 0
6 2015-03-27 0
7 2015-03-28 0
8 2015-03-29 0
9 2015-03-30 0
10 2015-03-31 0
# … with 201 more rows
What I want to do is to take ten 10-day long samples of continous dates from within that time series without replacement. For example, a valid sample would be the ten days from 2015-04-01 to 2015-04-10 because that falls completely within the dates
column in my date_data
data frame. Each sample would then get a unique (non-zero) number in the sample_id
column in date_data
such as 1:10
.
To be clear, my requirements are:
Each sample would be 10 consecutive days.
The sampling has to be without replacement. So if sample_id == 1
is the 2015-04-01 to 2015-04-10 period, those dates can't be part of another 10-day-long sample.
Each 10-day-long sample can't include any date that's not within date_data$dates
.
At the end, date_data$sample_id
would have unique numbers representing each 10-day-long sample, likely with lots of 0
s left over that were not part of any sample (and there would be 200 rows - 10 for each sample - where sample_id != 0
).
I am aware of dplyr::sample_n()
but it doesn't sample consecutive values, and I don't know how to devise a way to "remember" which dates have already been sampled...
What's a good way to do this? A for
loop?!?! Or perhaps something with purrr
? Thank you very much for your help.
UPDATE: Thanks to @gfgm's solution, it reminded me that performance is an important consideration. My real dataset is quite a bit larger, and in some cases I would want to take 20+ samples instead of just 10. Ideally the size of the sample can be changed as well, i.e. not necessarily 10-days long.
This is tricky, as you anticipated, because of the requirement of sampling without replacement. I have a working solution below which achieves a random sample and works fast on a problem of the scale given in your toy example. It should also be fine with more observations, but will get really really slow if you need to pick a lot of points relative to the sample size.
The basic premise is to pick n=10 points, generate the 10 vectors from these points forwards, and if the vectors overlap ditch them and pick again. This is simple and works fine given that 10*n << nrow(df)
. If you wanted to get 15 subvectors out of your 200 observations this would be a good deal slower.
library(tidyverse)
library(lubridate)
date_data <- tibble(dates = c(seq(ymd("2015-03-22"),
ymd("2015-07-03"),
by = "days"),
seq(ymd("2015-08-09"),
ymd("2015-10-01"),
by = "days"),
seq(ymd("2015-11-12"),
ymd("2016-01-03"),
by = "days")),
sample_id = 0L)
# A function that picks n indices, projects them forward 10,
# and if any of the segments overlap resamples
pick_n_vec <- function(df, n = 10, out = 10) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
vecs <- lapply(points, function(i){i:(i+(out - 1))})
while (max(table(unlist(vecs))) > 1) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
vecs <- lapply(points, function(i){i:(i+(out - 1))})
}
vecs
}
# demonstrate
set.seed(42)
indices <- pick_n_vec(date_data)
for (i in 1:10) {
date_data$sample_id[indices[[i]]] <- i
}
date_data[indices[[1]], ]
#> # A tibble: 10 x 2
#> dates sample_id
#> <date> <int>
#> 1 2015-05-31 1
#> 2 2015-06-01 1
#> 3 2015-06-02 1
#> 4 2015-06-03 1
#> 5 2015-06-04 1
#> 6 2015-06-05 1
#> 7 2015-06-06 1
#> 8 2015-06-07 1
#> 9 2015-06-08 1
#> 10 2015-06-09 1
table(date_data$sample_id)
#>
#> 0 1 2 3 4 5 6 7 8 9 10
#> 111 10 10 10 10 10 10 10 10 10 10
Created on 2019-01-16 by the reprex package (v0.2.1)
pick_n_vec2 <- function(df, n = 10, out = 10) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
while (min(diff(sort(points))) < 10) {
points <- sample(nrow(df) - (out - 1), n, replace = F)
}
lapply(points, function(i){i:(i+(out - 1))})
}