Search code examples
rrandomreplicate

Replicate Stratified Random Sampling without Replacement in R


I'm struggling to create a vectorized functional solution that will allow me to replicate stratified random sampling without replacement over many iterations. I'm able to sample without replacement once, then remove those rows from the dataset and then repeat the process from the unsampled observations. Unfortunately I'm needing to do this many times which makes this manual option impossible.

I've tried using the replicate() function, however I'm only able to have it sample without replacement for each pass. It puts the chosen samples back into the dataset for the next sampling pull.

Using the code below, I'd like the function to create 30 new datasets composed of 3 unique (previously unsampled) rows each from the "one" and "zero" sets. So each new dataset would have 6 total observations (3-1's and 3-0's) and be named something unique (i.e. "new_dat1", "new_dat2"..."new_dat30").

If possible, I'm looking to achieve all of this without using for loops, so something in the "apply" family is preferred.

set.seed(123)
dat <- data.frame(Outcome = round(runif(160, 0, 1)))
cust <- data.frame(Cust = rep(c("ABC", "DEF", "GHI"), c(45, 80, 35)))
dat <- cbind(cust, dat)

one <- subset(dat, Outcome == 1)
zero <- subset(dat, Outcome == 0)


# Manual option which is not sufficient    
################################################
# sample 1's and remove choosen obs from "one" dataset
set.seed(123)
index <- sample(1:nrow(one), 3, replace = FALSE)
new_dat1 <- one[index, ]
unused_one <- one[-index, ]

# sample 0's and remove choosen obs from "zero" dataset
set.seed(123)
index <- sample(1:nrow(zero), 3, replace = FALSE)
unused_zero <- zero[-index, ]

# combine the 3-1 and 3-0 samples into the first of 30 "new_datn" sets
new_dat1 <- rbind(new_dat1, zero[index, ])

# repeat, now sampling from "unused_one" and "unused_zero" to create "new_dat2" - "new_dat30"
################################################


# Failed attempt using the replicate() function
################################################
set.seed(123)
one_sample <- replicate(30, one[sample(nrow(one), 3, replace = FALSE), ], simplify = FALSE)
zero_sample <- replicate(30, zero[sample(nrow(zero), 3, replace = FALSE), ], simplify = FALSE)

Making this even more complicated is the fact that my total number of 0 and 1 observations in the "dat" set will vary from time to time so I'll likely always have remainders to deal with. So the function must be able to sample 3 for each "new_dat" until it runs into a remainder for the final set, which can go into the final "new_dat" regardless of the value.

Even if I could figure out how to solve the sampling issue in a vectorized function, I would really be at a loss to have the function create new datasets and name them appropriately.

I would be very grateful if anyone could provide me with some assistance. Thank you for taking the time to read through my post.


Solution

  • If I understood what you want correctly, here is one solution.

    First just sample the whole vector, that is, you are just going to randomly sort the row numbers:

    sample_rows  <- sample(nrow(one))
    

    Then assign a sample group for each of the randomly distributed rows (3 elements by group). Since the number of elements may not be divisible by 3, extend the length of the vector so it has the same length as the number of rows. Now fill the NA's with the next group (I think that is what you meant by "remainder for the final set"):

    sample_group <- rep(1:(length(sample_rows)%/% 3), each = 3)
    length(sample_group) <- length(sample_rows)
    sample_group[is.na(sample_group)] <- max(sample_group, na.rm = TRUE) + 1
    

    So now you have 24 samples of 3 and 1 sample of two, without replacement:

    samples <- data.frame(sample_rows, sample_group)
    
    head(samples)
      sample_rows sample_group
    1          12            1
    2           6            1
    3          41            1
    4          35            2
    5          71            2
    6          62            2
    
    tail(samples)
       sample_rows sample_group
    69          69           23
    70          53           24
    71          32           24
    72          27           24
    73          18           25
    74          65           25
    

    I did this for the vector of "one"s but you can easily replicate that for the vector of zeros and combine them.

    PS: You can extract the rows from the data.frame using split() and lapply(). For example:

    new_dat <- lapply(split(samples$sample_rows, samples$sample_group), function(x) one[x,])
    

    So new_dat is a list with all 25 data.frames. For example:

    new_dat[[8]] # gives you the eigth data.frame
    

    Or:

    new_dat[[25]] # gives you the last data.frame