Search code examples
rrandomstatisticssamplingsurvey

survey package in R: How to set fpc argument (finite population correction)


I have sampled some data from a sampling frame using the probability proportional to size (PPS) plan such that I have sampled 6 strata on combination of two variables: gender and pre with proportions:

      pre
gender  High   Low Medium
     F 0.155 0.155  0.195
     M 0.155 0.155  0.185

Now I want to specify the design of my sampled data using svydesign from R package "survey". I was wondering how to define the fpc (finite population correction) argument?

The documentation says:

For PPS sampling without replacement it is necessary to specify the probabilities for each stage of sampling using the fpc argument, and an overall weight argument should not be given.

library(survey)

out <- read.csv('https://raw.githubusercontent.com/rnorouzian/d/master/out.csv')

dstrat <- svydesign(id=~1,strata=~gender+pre, data=out, pps = "brewer", fpc = ????)

Solution

  • If we want to add proportion column, then we group by 'gender', 'pre', create the percentage by taking the count divided by the sum of counts and left_join

    out1 <-  out %>%
               group_by(gender, pre) %>% 
               summarise(n = n(), .groups = 'drop') %>%
               mutate(fpc = n/sum(n)) %>% 
               right_join(out)
    

    Or using adorn_percentages from janitor

    library(janitor)
    library(tidyr)
    out1 <- out %>% 
             tabyl(gender, pre) %>% 
             adorn_percentages(denominator = "all") %>% 
             pivot_longer(cols = -gender, names_to = 'pre', 
                 values_to = 'fpc') %>%
            right_join(out)
    

    If we need a function

    f1 <- function(dat, grp_cols) {
              dat %>%
                 group_by(across(all_of(grp_cols))) %>%
                  summarise(n = n(), .groups = 'drop') %>%
                  mutate(fpc = n/sum(n)) %>% 
                  right_join(dat)
      }
    
    
    
    f1(out, c("gender", "pre"))
    #Joining, by = c("gender", "pre")
    # A tibble: 200 x 11
    #   gender pre       n   fpc   no. fake.name sector   pretest state email            phone      
    #   <chr>  <chr> <int> <dbl> <int> <chr>     <chr>      <int> <chr> <chr>            <chr>      
    # 1 F      High     31 0.155     1 Pont      Private     1352 NY    Pont@...com      xxx-xx-6216
    # 2 F      High     31 0.155     2 Street    NGO         1438 CA    Street@...com    xxx-xx-6405
    # 3 F      High     31 0.155     3 Galvan    Private     1389 NY    Galvan@...com    xxx-xx-9195
    # 4 F      High     31 0.155     4 Gorman    NGO         1375 CA    Gorman@...com    xxx-xx-1845
    # 5 F      High     31 0.155     5 Jacinto   Private     1386 CA    Jacinto@...com   xxx-xx-6237
    # 6 F      High     31 0.155     6 Shah      Public      1384 CA    Shah@...com      xxx-xx-5723
    # 7 F      High     31 0.155     7 Randon    Private     1360 TX    Randon@...com    xxx-xx-7542
    # 8 F      High     31 0.155     8 Koucherik NGO         1439 NY    Koucherik@...com xxx-xx-9137
    # 9 F      High     31 0.155     9 Waters    Industry    1414 TX    Waters@...com    xxx-xx-7560
    #10 F      High     31 0.155    10 David     Industry    1396 CA    David@...com     xxx-xx-6498
    # … with 190 more rows