I have the following data frame
ddd<-data.frame(minutes=1:15,positive=c(0,1,0,1,1,0,1,0,0,0,1,1,1,0,1))
Using sampling, I would like to find what is the probability that in k trials sampling from consecutive intervals of ddd$minutes of j length at least one ddd$positive
will appear. For example for j=2 (2 minute intervals) the sample space will be
ddd$minutes[1:2, 2:3, 3:4, 4:5, 5:6, 6:7, …:14:15]
. However, if in the first of the k trials the interval ddd$minutes[1:2]
gets sampled (one success), then the interval ddd$minutes[2:3]
is removed from the sampled space (before the next random sampling), as the two groups intersect (ddd$minutes[2]
exists in both).
This is not a simple a matter of sampling without replacement,as not only the sampled but also all groups that intersect with the already sampled should be removed from the sample space before the next sampling takes place.
EDIT (comment from Tim P): length(ddd$minutes)
can be somewhere between 1000-1200; k between 1 and 16. j between 1 and 30
EDIT2 (comment by Thierry)
I am giving an example, following a comment and answer by Thierry
ddd<-data.frame(minutes=1:15,positive=c(0,1,0,1,1,0,1,0,0,0,1,1,1,0,1))
l=3;k=3
Sample Space S0 (before the first sampling): S0:{1:3, 2:4, 3:5, 4:6, 5:7, 6:8, 7:9, 8:10, 9:11, 10:12, 11:13, 12:14, 13:15} length of S0 is 13 (n-k+1)
First trial out of k: the element 8:10 gets selected.
S1 is then redefined as S0 but without the elements 6:8, 7:9, 8:10, 9:11, 10:12, that intersect with the sampled element 8:10
So, S1 is:{ 1:3, 2:4, 3:5, 4:6, 5:7, 11:13, 12:14, 13:15}
Second trial out of k: element 4:6 gets selected
S2 is redefined as S1 without the elements 2:4, 3:5, 4:6, 5:7,
So, S2:{1:3, 11:13, 12:14, 13:15}
and so on until the *k*th sample. Eventually my goals is to run this kind of sampling a lot of times and see what is the probability that at least one ddd$success will appear will get picked up.
You could use a recursive function.
n <- 1000
j <- 10
set.seed(12345)
ddd <- data.frame(minutes=seq_len(n), positive = rbinom(n, 1, 0.1))
dataset <- ddd
k <- 16
sillySampling <- function(dataset, k, j){
i <- sample(nrow(dataset) - j + 1, 1)
thisSample <- max(dataset$positive[i - 1 + seq_len(j)])
if(k > 1){
toRemove <- i + -j:j
toRemove <- toRemove[toRemove >= 1 & toRemove <= nrow(dataset)]
thisSample <- c(thisSample, sillySampling(dataset[-toRemove, ], k = k - 1, j = j))
}
return(thisSample)
}
rowMeans(replicate(100, {
sapply(1:16, function(k){
sum(sillySampling(ddd, k, 10)) / k
})
}))