I want to randomize/shuffle a vector. Some of the vector elements are identical. After shuffling, identical elements should have a minimum distance of three (i.e. two other elements should be between identical elements).
Consider the following example vector in R:
x <- rep(LETTERS[1:5], 3) # Create example vector
x
# [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"
If I shuffle my vector using the sample function, some of the identical elements may be too close together. For instance, if I use the following R code, the element "C" appears directly after each other at positions 5 and 6:
set.seed(53135)
sample(x) # sample() function puts same elements too close
# [1] "B" "A" "E" "D" "C" "C" "E" "A" "B" "C" "D" "E" "A" "D" "B"
How could I ensure that identical elements have a minimum distance of three?
So basically we need to conditionally sample one element from the x vector that have not been selected in the min.dist-1
runs. Using purrr's reduce we can achieve this:
min.dist <- 2
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
[1] "A" "E" "D" "B" "A" "D" "E" "C" "D" "A" "C" "E" "B" "A" "E"
Bundled in a function
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
}
> shuffle(x, 3)
[1] "A" "C" "B" "D" "E" "A" "B" "C" "E" "D" "A" "B" "C" "E" "A"
> shuffle(x, 3)
[1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "D" "E" "C" "A"
> shuffle(x, 4)
[1] "C" "E" "D" "A" "B" "C" "E" "D" "A" "B" "C" "E" "D" "A" "B"
> shuffle(x, 4)
[1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "B" "D" "E" "C"
> shuffle(x, 2)
[1] "E" "A" "D" "E" "B" "D" "A" "E" "C" "D" "A" "E" "C" "A" "B"
> shuffle(x, 2)
[1] "B" "A" "D" "C" "B" "A" "E" "B" "A" "E" "B" "C" "D" "A" "E"
after @27ϕ9 comment:
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), ~ c(.x, sample(x[!x %in% tail(.x, min.dist) &( x %in% names(t <- table(x[x%in%.x]) > table(.x))[t] | !x %in% .x)], 1)), .init=sample(x,1))
}
> table(shuffle(rep(LETTERS[1:5], 3),2))
A B C D E
3 3 3 3 3
> table(shuffle(rep(LETTERS[1:5], 3),2))
Error in sample.int(length(x), size, replace, prob) :
invalid first argument
After some trial and error, looking at the fact that not always you're gonna have enough elements to space out the min.dist
I came up with a solution this code is the most explained from the ones above :
shuffle <- function(x, min.dist=2){
stopifnot(min.dist < length(unique(x)))
reduce(integer(length(x)-1), function(.x, ...){
# whether the value is in the tail of the aggregated vector
in.tail <- x %in% tail(.x, min.dist)
# whether a value still hasn't reached the max frequency
freq.got <- x %in% names(t<-table(x[x%in%.x]) > table(.x))[t]
# whether a value isn't in the aggregated vector
yet <- !x %in% .x
# the if is there basically to account for the cases when we don't have enough vars to space out the vectors
c(.x, if(any((!in.tail & freq.got) | yet )) sample(x[(!in.tail & freq.got) | yet ], 1) else x[which(freq.got)[1]] )
}, .init=sample(x,1))
}
now running the table(shuffle(rep(LETTERS[1:5], 3),2))
will always return 3 for all vars and we can say with some certainty that in the vector the variables are spaced with a minimum distance of 2. the only way to guarantee that no elements are duplicated is by using min.dist=length(unique(x))-1
otherwise there will be instances where at maximum r < min.dist
elements are not min.dist
distanced from their last occurrences, and if such elements exist they're going to be in the length(x) + 1 - 1:min.dist
subset of the resulting vector.
Just to be completely certain using a loop to check whether tail of the output vector has unique values: (remove the print statement I used it just for demonstration purposes)
shuffler <- function(x, min.dist=2){
while(!length(unique(print(tail(l<-shuffle(x, min.dist=min.dist), min.dist+1))))==min.dist+1){}
l
}
table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "A" "B" "C" "E" "B" "C" "D" "A" "C" "D" "A" "E" "B" "D" "E"
A B C D E
3 3 3 3 3
table(print(shuffler(rep(LETTERS[1:5], 3),2)))
[1] "D" "C" "C"
[1] "C" "C" "E"
[1] "C" "A" "C"
[1] "D" "B" "D"
[1] "B" "E" "D"
[1] "C" "A" "E" "D" "A" "B" "C" "E" "A" "B" "D" "C" "B" "E" "D"
A B C D E
3 3 3 3 3
shuffler <- function(x, min.dist=2){
while(any(unlist(lapply(unique(tl<-tail(l<-shuffle(x, min.dist=min.dist), 2*min.dist)), function(x) diff(which(tl==x))<=min.dist)))){}
l
}
this new version does a rigorous test on whether the elements in the tail of the vector are min.dist
anced, the previous version works for min.dist=2
, however this new version does better testing.