Search code examples
rdplyrsample

Using weights for sampling with replacement with the sample_n() function


All,

I have a dplyr sample_n() question. I'm trying to sample with replacement while using the weight option and I seem to be hitting a snag. Namely, sampling with replacement is consistently oversampling a group. It's not a problem I'm getting when sampling without replacement, but I'd really like to do sampling with replacement if I could.

Here's a minimal working example that uses the familiar apistrat and apipop data from the survey package. Survey researchers in R know these data well. In the population data (apipop), elementary schools (stype == E) account for about 71.4% of all schools. Middle schools (stype == M) are about 12.2% of all schools and high schools (stype == H) are about 16.4% of all schools. The apistrat has a deliberate imbalance in which the elementary schools are 50% of the data while middle schools and high schools are each the remaining 25% of the 200-row sample.

What I'd like to do is sample the apistrat data, with replacement, using the sample_n() function. However, I seem to be consistently oversampling the elementary schools and undersampling the middle schools and high schools. Here's a minimal working example in R code. Please forgive my cornball looping code. I know I need to get better at purrr but I'm not quite there yet. :P

library(survey)
library(tidyverse)

apistrat %>% tbl_df() -> strat
apipop %>% tbl_df() -> pop

pop %>%
  group_by(stype) %>% 
  summarize(prop = n()/6194) -> Census

Census
# p(E) = ~.714
# p(H) = ~.122
# p(M) = ~.164

strat %>%
  left_join(., Census) -> strat

# Sampling with replacement seems to consistently oversample E and undersample H and M.
with_replace <- tibble()
set.seed(8675309) # Jenny, I got your number...

for (i in 1:1000) {
strat %>%
    sample_n(100, replace=T, weight = prop) %>%
    group_by(stype) %>%
    summarize(i = i,
              n = n(),
              prop = n/100) -> hold_this
with_replace <- bind_rows(with_replace, hold_this)

}

# group_by means with 95% intervals
with_replace %>%
  group_by(stype) %>%
  summarize(meanprop = mean(prop),
            lwr = quantile(prop, .025),
            upr = quantile(prop, .975))

# ^ consistently oversampled E.
# meanprop of E = ~.835.
# meanprop of H = ~.070 and meanprop of M = ~.095
# 95% intervals don't include true probability for either E, H, or M.

# Sampling without replacement doesn't seem to have this same kind of sampling problem.
wo_replace <- tibble()
set.seed(8675309)  # Jenny, I got your number...

for (i in 1:1000) {
  strat %>%
    sample_n(100, replace=F, weight = prop) %>%
    group_by(stype) %>%
    summarize(i = i,
              n = n(),
              prop = n/100) -> hold_this
  wo_replace <- bind_rows(wo_replace, hold_this)

}

# group_by means with 95% intervals
wo_replace %>%
  group_by(stype) %>%
  summarize(meanprop = mean(prop),
            lwr = quantile(prop, .025),
            upr = quantile(prop, .975))


# ^ better in orbit of the true probability
# meanprob of E = ~.757. meanprob of H = ~.106. meanprob of M = ~.137
# 95% intervals include true probability as well.

I'm not sure if this is a dplyr (v. 0.8.3) problem. The 95% intervals for sampling with replacement don't include the true probability and each sample (were you to peak at them) are consistently in that mid-.80s range for sampling the elementary schools. Only three of the 1,000 samples (with replacement) had a composition where elementary schools were fewer than 72% of the 100-row sample. It's that consistent. I'm curious if anyone here as any insight to what's happening, or possibly what I might be doing wrong and if I'm misinterpreting the functionality of sample_n().

Thanks in advance.


Solution

  • The sample_n() function in dplyr is a wapper for base::sample.int(). Looking at base::sample.int()--and the actual function is implemented in C. And we can see that the problem comes from the source:

    rows <- sample(nrow(strat), size = 100, replace=F, prob = strat$prop)
    strat[rows, ] %>% count(stype)
    # A tibble: 3 x 2
      stype     n
      <fct> <int>
    1 E        74
    2 H        14
    3 M        12
    
    rows <- sample(nrow(strat), size = 100, replace=T, prob = strat$prop)
    strat[rows, ] %>% count(stype)
    # A tibble: 3 x 2
      stype     n
      <fct> <int>
    1 E        85
    2 H         8
    3 M         7
    

    I'm honestly not totally sure why this is the case, but if you make the probabilities sum to 1 and make them uniform within group, then it gives the sample sizes expected:

    library(tidyverse)
    library(survey)
    
    data(api)
    
    apistrat %>% tbl_df() -> strat
    apipop %>% tbl_df() -> pop
    
    pop %>%
      group_by(stype) %>% 
      summarize(prop = n()/6194) -> Census
    
    
    strat %>%
      left_join(., Census) -> strat
    #> Joining, by = "stype"
    
    set.seed(8675309) # Jenny, I got your number...
    with_replace <- tibble()
    
    for (i in 1:1000) {
      strat %>%
        group_by(stype) %>%
        mutate(per_prob = sample(prop/n())) %>% 
        ungroup() %>% 
        sample_n(100, replace=T, weight = per_prob) %>%
        group_by(stype) %>%
        summarize(i = i,
                  n = n(),
                  prop = n/100) -> hold_this
      with_replace <- bind_rows(with_replace, hold_this)
    
    }
    
    with_replace %>%
      group_by(stype) %>%
      summarize(meanprop = mean(prop),
                lwr = quantile(prop, .025),
                upr = quantile(prop, .975))
    #> # A tibble: 3 x 4
    #>   stype meanprop   lwr   upr
    #>   <fct>    <dbl> <dbl> <dbl>
    #> 1 E        0.713  0.63  0.79
    #> 2 H        0.123  0.06  0.19
    #> 3 M        0.164  0.09  0.24
    

    Created on 2020-04-17 by the reprex package (v0.3.0)

    I'm guessing that this has something to do with the entities within the vector of p not being diminished by replace = TRUE, but really I have no idea what's going on under the hood. Someone with C knowledge should take a look!