Search code examples
rrandomsamplegroup

Randomizing 1s and 0s by groups while specifiying proportion of 1 and 0 within groups


First, I want to create a column that randomize 1s and 0s by group while maintaining the same proportion of 1s and 0s in another column.

Second, I want to repeat the above procedure many times (say 1000) and calculate the expected value.

Let me clarify with hypothetical data.

library(data.table) 

district <- c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)                                       
village <- c(1,2,3,4,1,2,3,4,5,1,2,3,4,5,6,7)                              
status <- c(1,0,1,0, 1,1,1,0,0,1,1,1,1,0,0,0) 

datei <- data.table(district, village, status) 

What I want to do is I want to create a column that randomize 1s and 0s within a district while maintaining the same proportion of 1s and 0s in status; the proportions of 1:0 are 2:2, 3:2 and 4:3 in district 1, 2 and 3 respectively.

Second, I also want to repeat this randomization many times (say 1000 times) and calculate the expected value for each row.

I know how to randomize 1s and 0s based on district.

datei[, random_status := sample(c(1,0), .N, replace=TRUE), keyby = district]

However, I do not know how to have the same proportion of 1s and 0s as in status and how to repeat and calculate the expected values for each row.

Many thanks.

Edit: Let me add what I expect regarding calculating the expected values for each raw after, say, 1000 repetitions. Column exp_status is generated after randomizing many times while keeping the proportion of 1:0 within district is the same as in status.

district village status exp_status
1 1 1 0.9
1 2 0 0.7
1 3 1 0.8
1 4 0 0.1
2 1 1 0.2
2 2 1 0.3
2 3 1 0.2
2 4 0 0.9
2 5 0 0.8
3 1 1 0.4
3 2 1 0.5
3 3 1 0.9
3 4 1 0.8
3 5 0 0.9
3 6 0 0.8
3 7 0 0.7

Solution

  • Use a table as prob=, which gives on large scale similar proportions.

    set.seed(42)
    datei[, random_status := sample(0:1, .N, replace=TRUE, prob=table(status)), keyby = district]
    
    colMeans(datei[, 3:4])
          #  status random_status 
          # 0.56339       0.56277 
    

    Data:

    (slightly blown up, to 1e5 rows)

    datei <- structure(list(district = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 
    3, 3, 3, 3, 3), village = c(1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 
    3, 4, 5, 6, 7), status = c(1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 
    1, 0, 0, 0)), row.names = c(NA, -16L), class = c("data.table", 
    "data.frame"))
    
    set.seed(42)
    datei <- datei[sample.int(nrow(datei), 1e5, replace=TRUE), ]