Search code examples
rrandomdistancesample

Shuffle Vector in R, But Identical Elements Should Have Minimum Distance


I want to randomize/shuffle a vector. Some of the vector elements are identical. After shuffling, identical elements should have a minimum distance of three (i.e. two other elements should be between identical elements).

Consider the following example vector in R:

x <- rep(LETTERS[1:5], 3)  # Create example vector
x
#  [1] "A" "B" "C" "D" "E" "A" "B" "C" "D" "E" "A" "B" "C" "D" "E"

If I shuffle my vector using the sample function, some of the identical elements may be too close together. For instance, if I use the following R code, the element "C" appears directly after each other at positions 5 and 6:

set.seed(53135)
sample(x)                  # sample() function puts same elements too close
#  [1] "B" "A" "E" "D" "C" "C" "E" "A" "B" "C" "D" "E" "A" "D" "B"

How could I ensure that identical elements have a minimum distance of three?


Solution

  • So basically we need to conditionally sample one element from the x vector that have not been selected in the min.dist-1 runs. Using purrr's reduce we can achieve this:

    min.dist <- 2
    reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
    
    [1] "A" "E" "D" "B" "A" "D" "E" "C" "D" "A" "C" "E" "B" "A" "E"
    

    Bundled in a function

    shuffle <- function(x, min.dist=2){
        stopifnot(min.dist < length(unique(x)))
        reduce(integer(length(x)-1), ~c(.x, sample(x[!x %in% tail(.x, min.dist)], 1)), .init=sample(x,1))
    }
    
    > shuffle(x, 3)
     [1] "A" "C" "B" "D" "E" "A" "B" "C" "E" "D" "A" "B" "C" "E" "A"
    > shuffle(x, 3)
     [1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "D" "E" "C" "A"
    > shuffle(x, 4)
     [1] "C" "E" "D" "A" "B" "C" "E" "D" "A" "B" "C" "E" "D" "A" "B"
    > shuffle(x, 4)
     [1] "A" "B" "D" "E" "C" "A" "B" "D" "E" "C" "A" "B" "D" "E" "C"
    > shuffle(x, 2)
     [1] "E" "A" "D" "E" "B" "D" "A" "E" "C" "D" "A" "E" "C" "A" "B"
    > shuffle(x, 2)
     [1] "B" "A" "D" "C" "B" "A" "E" "B" "A" "E" "B" "C" "D" "A" "E"
    

    after @27ϕ9 comment:

    shuffle <- function(x, min.dist=2){
        stopifnot(min.dist < length(unique(x)))
        reduce(integer(length(x)-1), ~ c(.x, sample(x[!x %in% tail(.x, min.dist) &( x %in% names(t <- table(x[x%in%.x]) > table(.x))[t] | !x %in% .x)], 1)), .init=sample(x,1))
    }
    > table(shuffle(rep(LETTERS[1:5], 3),2))
    
    A B C D E 
    3 3 3 3 3 
    > table(shuffle(rep(LETTERS[1:5], 3),2))
    Error in sample.int(length(x), size, replace, prob) : 
      invalid first argument
    

    UPDATE

    After some trial and error, looking at the fact that not always you're gonna have enough elements to space out the min.dist I came up with a solution this code is the most explained from the ones above :

    shuffle <- function(x, min.dist=2){
        stopifnot(min.dist < length(unique(x)))
        reduce(integer(length(x)-1), function(.x, ...){
            # whether the value is in the tail of the aggregated vector
            in.tail <- x %in% tail(.x, min.dist)
            # whether a value still hasn't reached the max frequency
            freq.got <- x %in% names(t<-table(x[x%in%.x]) > table(.x))[t]
            # whether a value isn't in the aggregated vector
            yet <- !x %in% .x
            # the if is there basically to account for the cases when we don't have enough vars to space out the vectors
             c(.x, if(any((!in.tail & freq.got) | yet )) sample(x[(!in.tail & freq.got) | yet ], 1) else  x[which(freq.got)[1]] )
        }, .init=sample(x,1))
    }
    

    now running the table(shuffle(rep(LETTERS[1:5], 3),2)) will always return 3 for all vars and we can say with some certainty that in the vector the variables are spaced with a minimum distance of 2. the only way to guarantee that no elements are duplicated is by using min.dist=length(unique(x))-1 otherwise there will be instances where at maximum r < min.dist elements are not min.dist distanced from their last occurrences, and if such elements exist they're going to be in the length(x) + 1 - 1:min.dist subset of the resulting vector.

    Just to be completely certain using a loop to check whether tail of the output vector has unique values: (remove the print statement I used it just for demonstration purposes)

    shuffler <- function(x, min.dist=2){
        while(!length(unique(print(tail(l<-shuffle(x, min.dist=min.dist), min.dist+1))))==min.dist+1){}
        l
    }
    
    table(print(shuffler(rep(LETTERS[1:5], 3),2)))
     [1] "A" "B" "C" "E" "B" "C" "D" "A" "C" "D" "A" "E" "B" "D" "E"
    
    A B C D E 
    3 3 3 3 3 
    
    table(print(shuffler(rep(LETTERS[1:5], 3),2)))
    [1] "D" "C" "C"
    [1] "C" "C" "E"
    [1] "C" "A" "C"
    [1] "D" "B" "D"
    [1] "B" "E" "D"
     [1] "C" "A" "E" "D" "A" "B" "C" "E" "A" "B" "D" "C" "B" "E" "D"
    
    A B C D E 
    3 3 3 3 3 
    

    Update:

    shuffler <- function(x, min.dist=2){
        while(any(unlist(lapply(unique(tl<-tail(l<-shuffle(x, min.dist=min.dist), 2*min.dist)), function(x) diff(which(tl==x))<=min.dist)))){}
        l
    }
    

    this new version does a rigorous test on whether the elements in the tail of the vector are min.distanced, the previous version works for min.dist=2, however this new version does better testing.