Search code examples
rrandomtidyversesample

How can I draw a random sample from a dataset, proportionate to size, based on different proportions for each value of a factor variable, in R


I want to draw a random sample from my dataset, using different proportions for each value of a factor variable, as well as using weights stored in some other column. dplyr solution in pipes will be preferred as it can be inserted easily in long code.

Let's take the example of iris dataset. Species column is divided into three values 50 rows each. Let's also assume the sample weights are stored in column Sepal.Length. If I have to sample equal proportions (or equal rows) per species, the problem is easy to solve

library(tidyverse)

iris %>% group_by(Species) %>% slice_sample(prop = 0.1, weight_by = Sepal.Length)

# A tibble: 15 x 5
# Groups:   Species [3]
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
 1          5.4         3.7          1.5         0.2 setosa    
 2          5.3         3.7          1.5         0.2 setosa    
 3          5.7         4.4          1.5         0.4 setosa    
 4          5           3.5          1.6         0.6 setosa    
 5          4.8         3.1          1.6         0.2 setosa    
 6          6.1         2.9          4.7         1.4 versicolor
 7          6.7         3.1          4.7         1.5 versicolor
 8          5           2            3.5         1   versicolor
 9          7           3.2          4.7         1.4 versicolor
10          5.7         2.9          4.2         1.3 versicolor
11          7.2         3.2          6           1.8 virginica 
12          6.7         2.5          5.8         1.8 virginica 
13          6.4         2.8          5.6         2.1 virginica 
14          6.3         3.3          6           2.5 virginica 
15          7.2         3            5.8         1.6 virginica 

But I got stuck when I have to choose/sample different proportions for each species, say 10%, 20%, 25% respectively.

iris %>% group_by(Species) %>% slice_sample(prop = c(0.1, 0.2, 0.25), weight_by = Sepal.Length)

#Error: `prop` must be a single number

OR

iris %>% group_split(Species) %>% map_df(c(0.1, 0.2, 0.25), ~ slice_sample(prop = ., weight_by = Sepal.Length))
# A tibble: 0 x 0

Please help


Solution

  • If I understand you right:

    iris %>% 
      group_split(Species) %>% 
      map2(c(0.1, 0.2, 0.25), ~ slice_sample(.x, prop = .y))
    
    [[1]]
    # A tibble: 5 x 5
      Sepal.Length Sepal.Width Petal.Length Petal.Width Species
             <dbl>       <dbl>        <dbl>       <dbl> <fct>  
    1          4.9         3            1.4         0.2 setosa 
    2          4.8         3            1.4         0.1 setosa 
    3          5.2         4.1          1.5         0.1 setosa 
    4          5           3.5          1.6         0.6 setosa 
    5          5.2         3.5          1.5         0.2 setosa 
    
    [[2]]
    # A tibble: 10 x 5
       Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
              <dbl>       <dbl>        <dbl>       <dbl> <fct>     
     1          6.3         2.5          4.9         1.5 versicolor
     2          5.5         2.6          4.4         1.2 versicolor
     3          6.9         3.1          4.9         1.5 versicolor
     4          6.6         2.9          4.6         1.3 versicolor
     5          6.1         3            4.6         1.4 versicolor
     6          5.7         2.8          4.5         1.3 versicolor
     7          6.7         3.1          4.4         1.4 versicolor
     8          5.1         2.5          3           1.1 versicolor
     9          5.7         3            4.2         1.2 versicolor
    10          7           3.2          4.7         1.4 versicolor
    
    [[3]]
    # A tibble: 12 x 5
       Sepal.Length Sepal.Width Petal.Length Petal.Width Species  
              <dbl>       <dbl>        <dbl>       <dbl> <fct>    
     1          6.4         3.2          5.3         2.3 virginica
     2          7.2         3.2          6           1.8 virginica
     3          6.3         3.3          6           2.5 virginica
     4          6.2         2.8          4.8         1.8 virginica
     5          7.6         3            6.6         2.1 virginica
     6          5.7         2.5          5           2   virginica
     7          4.9         2.5          4.5         1.7 virginica
     8          6.7         3.1          5.6         2.4 virginica
     9          7.7         2.8          6.7         2   virginica
    10          6.7         3.3          5.7         2.5 virginica
    11          6           3            4.8         1.8 virginica
    12          5.6         2.8          4.9         2   virginica
    

    Just change map2 to map2_df if you want a data frame returned:

    iris %>% 
      group_split(Species) %>% 
      map2_df(c(0.1, 0.2, 0.25), ~ slice_sample(.x, prop = .y))
    
    # A tibble: 27 x 5
       Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
              <dbl>       <dbl>        <dbl>       <dbl> <fct>     
     1          5.7         3.8          1.7         0.3 setosa    
     2          4.8         3.1          1.6         0.2 setosa    
     3          5.1         3.8          1.5         0.3 setosa    
     4          4.9         3.6          1.4         0.1 setosa    
     5          4.8         3.4          1.6         0.2 setosa    
     6          5.7         2.8          4.1         1.3 versicolor
     7          6.6         3            4.4         1.4 versicolor
     8          6.8         2.8          4.8         1.4 versicolor
     9          5.8         2.7          4.1         1   versicolor
    10          6.4         3.2          4.5         1.5 versicolor
    # ... with 17 more rows