Search code examples
rsampling

Add dummy column to flag as the row is randomly selected or not


Suppose I have the following data set (named data).

id var1 var2
1   A   33
2   B   23
3   A   45
4   A   55
5   B   22
6   A   33
7   B   90
8   A   78
9   B   12
10  A   11

My intention is to add a new column to the original data set that indicates whether each row of data set is randomly selected or not (1/0). I tried the following.

library(sampling)
data1 <- strata(data,"var1", size=c(4,3),method="srswor") #stratified random sampling
data2 <- getdata(data,data1)  # this gives a separate data set

Any help, please? Thanks!


Solution

  • If you look in the documentation of sampling::strata() you'll find the following information:

    The function produces an object, which contains the following information:
    
    ID_unit 
    the identifier of the selected units.
    
    Stratum 
    the unit stratum.
    
    Prob    
    the unit inclusion probability.
    

    ID_Unit can used to subset the original data and assign the boolean you asked for:

    data<-structure(list(id=c(1,2,3,4,5,6,7,8,9,10),var1=c("A",
    "B","A","A","B","A","B","A","B","A"),var2=c(33,23,
    45,55,22,33,90,78,12,11)),row.names=c(NA,-10L),class=c("tbl_df",
    "tbl","data.frame"))
    
    
    library(sampling)
    data1 <- strata(data,"var1", size=c(4,3),method="srswor") #stratified random sampling
    data2 <- getdata(data,data1)  # this gives a separate data set
    
    data$sampled <- FALSE
    data[data1$ID_unit, "sampled"] <- TRUE                 
    data
    #>    id var1 var2 sampled
    #> 1   1    A   33   FALSE
    #> 2   2    B   23    TRUE
    #> 3   3    A   45   FALSE
    #> 4   4    A   55    TRUE
    #> 5   5    B   22   FALSE
    #> 6   6    A   33    TRUE
    #> 7   7    B   90    TRUE
    #> 8   8    A   78    TRUE
    #> 9   9    B   12    TRUE
    #> 10 10    A   11    TRUE
    

    Created on 2020-07-28 by the reprex package (v0.3.0)