Search code examples
rdplyrsample

Sample a proportion of each group, BUT with a minimum constraint (using dplyr)


I have a population of 6 categories (stratum) and I want in each stratum to take the 10% as a sample. Doing so I take:

var = c(rep("A",10),rep("B",10),rep("C",3),rep("D",5),"E","F");var
value = rnorm(30)
dat = tibble(var,value);
pop=dat%>%group_by(var)
pop
singleallocperce = slice_sample(pop, prop=0.1);
singleallocperce

with result:

# A tibble: 2 x 2
# Groups:   var [2]
  var   value
  <chr> <dbl>
1 A     -1.54
2 B     -1.12

But I want even if in some stratum that the polupation inside them cannot reach the taken sample of 10% to take at least one observation.How can I do it this in R using dplyr package?

Additional

Additionally if I want to make proportional allocation sampling (ie with weight proportional to the subpopulation of each stratum fro example for A the weight will be : 10/30,for B: 10/30,for C:3/30,D:5/30 etc ) keeping the constrain of 1 observation if the subpopulation does not meet that requirement ?


Solution

  • Possible approach (note the presence of 20 x A to check two are returned).

    library(tidyverse)
    
    # Data (note 20 As)
    var = c(rep("A",20),rep("B",10),rep("C",3),rep("D",5),"E","F")
    value = rnorm(40)
    dat = tibble(var, value)
    
    # Possible approach
    dat %>%
      group_by(var) %>%
      mutate(min = if_else(n() * 0.1 >= 1, n() * 0.1, 1),
             random = sample(n())) %>%
      filter(random <= min) |> 
      select(var, value)
    #> # A tibble: 7 × 2
    #> # Groups:   var [6]
    #>   var     value
    #>   <chr>   <dbl>
    #> 1 A      0.0105
    #> 2 A      0.171 
    #> 3 B     -1.89  
    #> 4 C      1.89  
    #> 5 D      0.612 
    #> 6 E      0.516 
    #> 7 F      0.185
    

    Created on 2022-06-02 by the reprex package (v2.0.1)

    Weighted version:

    library(tidyverse)
    
    # Data (note 20 As)
    var = c(rep("A",20),rep("B",10),rep("C",3),rep("D",5),"E","F")
    value = rnorm(40)
    dat = tibble(var, value)
    
    # Possible approach
    dat %>%
      add_count(name = "n_all") %>%
      group_by(var) %>%
      mutate(
        weight = n() / n_all,
        min = if_else(n() * weight >= 1, n() * weight, 1),
        random = sample(n())
      ) %>%
      filter(random <= min) |>
      select(var, value)
    #> # A tibble: 16 × 2
    #> # Groups:   var [6]
    #>    var     value
    #>    <chr>   <dbl>
    #>  1 A      0.339 
    #>  2 A      1.77  
    #>  3 A     -0.145 
    #>  4 A     -0.915 
    #>  5 A      0.146 
    #>  6 A      0.896 
    #>  7 A     -0.407 
    #>  8 A     -1.30  
    #>  9 A      1.22  
    #> 10 A      0.0527
    #> 11 B     -0.602 
    #> 12 B     -0.432 
    #> 13 C     -0.0540
    #> 14 D     -1.45  
    #> 15 E      1.54  
    #> 16 F      0.879
    

    Created on 2022-06-09 by the reprex package (v2.0.1)