Search code examples
rfor-loopfiltersurveydata-wrangling

R: resample survey units at different proportions


I need to resample my survey. I have survey data from a country with 3 counties (1001, 1002, and 1003). The study surveyed 10 households in each county. I need to randomly pull the household id number of a smaller sample within each county.

The catch is that the % of household ids I need to pull varies from county to county. In other words, I need to extract a random sample of 25% of household ids in county 1001, 50% in county 1002, and 75% from county 1003.

Below is some mock data

set.seed(100)
mock.data <- data.frame(county= rep(c(1001:1003), each = 10),
                   household.id= sample(1000:4000, 30, replace=F))

Here is the proportion of household ids I need to pull from each county.

prop_to_sample <- data.frame(county=c(1001,1002,1003),
                             prop.households=c(0.25,0.50,0.75))

Below is the for loop command needed to extract household ids mock.data with the household proportions from prop_to_sample.

household.ids.saved <- NULL
counties.run <- unique(mock.data$county)
for (i in counties.run) {
ids <- mock.data %>%
  filter(county== **county**) %>%
  slice_sample(prop = **prop.households**) %>%
  ungroup() %>%
  pull(household.id)
household.ids.saved <- c(household.ids.saved, ids)
}

Thank you


Solution

  • You can do this in a number of ways.

    Here is a dplyr based approach, that uses group_map and sample_frac:

    f <- function(x,y) {
      p = with(prop_to_sample,prop.households[county==y$county])
      slice_sample(x,prop = p)
    }
    bind_rows(group_map(group_by(mock.data,county),f,.keep = T))
    

    Output:

    # A tibble: 14 x 2
       county household.id
        <int>        <int>
     1   1001         1502
     2   1001         2121
     3   1002         3346
     4   1002         2330
     5   1002         3371
     6   1002         3513
     7   1002         2527
     8   1003         3996
     9   1003         1346
    10   1003         2807
    11   1003         3675
    12   1003         1509
    13   1003         1970
    14   1003         1604
    

    Here is a possible approach using data.table

    library(data.table)
    setDT(mock.data)
    setDT(prop_to_sample)
    
    mock.data[, sample(household.id, size = .N*(prop_to_sample[county==.BY, prop.households])), county]
    

    Output:

        county   V1
     1:   1001 2121
     2:   1001 3885
     3:   1002 2330
     4:   1002 3346
     5:   1002 3955
     6:   1002 2527
     7:   1002 2816
     8:   1003 2190
     9:   1003 1346
    10:   1003 1604
    11:   1003 1509
    12:   1003 1947
    13:   1003 1970
    14:   1003 2807
    

    Here is another approach, which uses apply() over the rows of prop_to_sample:

    rbindlist(
      apply(prop_to_sample,1,\(r) setDT(mock.data)[county==r[1]][sample(.N, .N*r[2])])
    )