Search code examples
rsampling

Using a sample list as a template for sampling from a larger list without wraparound


If I have a vector of letters:

> all <- letters
> all
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

and then I define a reference sample from letters as follows:

> refSample <- c("j","l","m","s")

in which the spacing between elements is 2 (1st to 2nd), 1 (2nd to 3rd) and 6 (3rd to 4th), how can I then select n samples from all that have identical, non-wrap-around spacing between its elements to refSample? For example, "a","c","d","j" and "q" "s" "t" "z" would be valid samples, but "a","c","d","k" and "r" "t" "u" "a" would not. The former has an index difference of 7 (rather than 6) between the 3rd and last element, whereas the latter has the correct spacing but wraps around.

Second, how can I parameterise this, so that whatever refSample is used, I can use the spacing of that as a template?


Solution

  • Here's a simple way --

    all <- letters                                                                                                                                                                                                                                                                
    refSample <- c("j","l","m","s")                                                                                                                                                                                                                                               
    
    
    pick_matches <- function(n, ref, full) {                                                                                                                                                                                                                                      
      iref <- match(ref,full)                                                                                                                                                                                                                                                     
      spaces <- diff(iref)                                                                                                                                                                                                                                                        
      tot_space <- sum(spaces)                                                                                                                                                                                                                                                    
      max_start <- length(full)  - tot_space                                                                                                                                                                                                                                      
      starts <- sample(1:max_start, n, replace = TRUE)                                                                                                                                                                                                                            
      return( sapply( starts, function(s) full[ cumsum(c(s, spaces)) ] ) )                                                                                                                                                                                                        
    }                                                                                                                                                                                                                                                                             
    
    > set.seed(1)                                                                                                                                                                                                                                                                
    > pick_matches(5, refSample, all) # each COLUMN is a desired sample vector                                                                                                                                                                                                                                         
          [,1] [,2] [,3] [,4] [,5]                                                                                                                                                                                                                                                
     [1,] "e"  "g"  "j"  "p"  "d"                                                                                                                                                                                                                                                 
     [2,] "g"  "i"  "l"  "r"  "f"                                                                                                                                                                                                                                                 
     [3,] "h"  "j"  "m"  "s"  "g"                                                                                                                                                                                                                                                 
     [4,] "n"  "p"  "s"  "y"  "m"