Search code examples
rsampling

Sampling from a data.frame while controlling for a proportion [stratified sampling]


I have the following dataset

id1<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
status<-c(1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2)
df<-data.frame(id1,status)

In df for 40% of my observations status is '2'. I am looking for a function to extract a sample of 10 observations from df while maintaining the above proportion.

I have already seen stratified random sampling from data frame in R but it is not talking about the proportions.


Solution

  • You can try the stratified function from my "splitstackshape" package:

    library(splitstackshape)
    stratified(df, "status", 10/nrow(df))
    #     id1 status
    #  1:   5      1
    #  2:  12      1
    #  3:   2      1
    #  4:   1      1
    #  5:   6      1
    #  6:   9      1
    #  7:  16      2
    #  8:  17      2
    #  9:  18      2
    # 10:  15      2
    

    Alternatively, using sample_frac from "dplyr":

    library(dplyr)
    
    df %>%
      group_by(status) %>%
      sample_frac(10/nrow(df))
    

    Both of these would take a stratified sample proportional to the original grouping variable (hence the use of 10/nrow(df), or, equivalently, 0.5).