Search code examples
rsamplingmultisampling

Multistage sampling with R with only final sample size given


I am trying to implement an algorithm for sampling in several stages where only the final size of the sample is known.

Here is an example of the structure of my sampling frame. Where:

  • cluster is a block of households.
  • total_households is the number of household in each block
  • group is a grouping of blocks depending on the number of households in the blocks.
  • Probability is the probability of select one group.

Then, the algorithm have the next steps: Given a sample size $n$

  1. Select one group with unequal probalities with simple random sampling whith replacement.
  2. Select with simple random sampling without replace one cluster whithin the group selected in the previous step and remove it from the sampling frame.
  3. in the previously selected cluster, select only 25% of households.
  4. Repeat until the exact sample size is reached

Because

        cluster   total_households group            Probability
 1       173494              13     2            4.055410e-01
 2       173495              19     5            4.176953e-02
 3       173496              22     5            4.176953e-02
 4       173497              21     5            4.176953e-02
 5       173498              18     5            4.176953e-02
 6       173499              27     7            6.775638e-05
 7       173500              15     4            5.020529e-01
8       173501              19     5            4.176953e-02

I want to implement this algorithm with R. I know there is a package for this called sampling with the multistage function, but it does not work. Because, I must specify the number of clusters and groups before implementing the algorithm. My programming skills are limited. I've been trying to do something with a while loop, but I think I'm far from the correct result.

    require(dplyr) # to use pipes in the code 

    n_sample = 844
    group = NULL
    total = NULL
    cluster = NULL
    total_households = NULL
    total = 0
    i = 1
    while(total < n_sample){
    group[i] = groups[sample(nrow(groups),size = 1,prob = groups$P),c("group")]
    total_households = data[data$group==group[i],] %>% 
                          sample_n(size=1) %>% 
                                select(total_households)
    cluster[i] = data[data$group==group[i],] %>%
                        sample_n(size=1) %>% 
                        select(cluster) %>% as.numeric() 
    data = data[data$cluster!=cluster[i],] 
    total = total+total_households
    i = i+1
    }

Solution

  • You are pretty close to what you want to achieve (leaving aside the tidiness of code and focusing on numbers):

    Firstly, lets correct the while loop: ( 2 modifications)

    while(total < n_sample){
    group[i] = groups[sample(nrow(groups),size = 1,prob = groups$P),c("group")]
    total_households = data[data$group==group[i],] %>% 
                          sample_n(size=1) %>% 
                          select(total_households) %>% as.numeric()          # Mod_1
    
    cluster[i] = data[data$group==group[i],] %>%
                        sample_n(size=1) %>% 
                        select(cluster) %>% as.numeric() 
    data = data[data$cluster!=cluster[i],] 
    total = total+ (total_households*0.25)                                   # Mod_2
    i = i+1
    }
    

    Note that you will end up with a total > n , but you can always adjust it to be equal n by modifying the no of households from last cluster in the list.

    Secondly, Important thing you need to take into consideration is that the sum of probabilities for the groups should add to 1 throughout the algorithm.