Search code examples
rloopsfor-loopiterationproportions

Using R or tidyverse to adjust proportion in my dataframe to match a known parameter (using prop.table or table or anything else)


I'm working on psychological assessment data that was collected in a non-random sample design. My dataframe is formed of "sex" (male and female) and "level of education"("elementary", "high school", "college"). However, my empirical distribution is different from the true distribution.

I know that true parameter for sex is 0.7 female and 0.3 male. I know as well that the true parameter for schooling is elementary equals 0.5, high school = 0.3, and college equal 0.2

I would like to have a code that could "cut" (adjust?) my dataframe to match these characteristics. I know my final dataframe will have fewer participants than my current one. I'm wondering if a for / loop solution is duable in this case.

Dat:

df2 = data.frame(
  sex = rep(c("m","f"),135),
  schooling = c("elementary","highschool","college")
)

prop.table(table(df2$sex))
prop.table(table(df2$schooling))

Solution

  • You could weight your observations by your desired proportions, then use dplyr::slice_sample():

    set.seed(13)
    library(dplyr)
    
    prop_sex <- c(f = 0.7, m = 0.3)
    prop_school <- c(elementary = 0.5, highschool = 0.3, college = 0.2)
    
    df3 <- df2 %>%
      mutate(wt = prop_sex[sex] * prop_school[schooling]) %>%
      slice_sample(prop = 0.6, weight_by = wt)
    
    prop.table(table(df3$sex))
    #        f        m 
    # 0.617284 0.382716 
    
    prop.table(table(df3$schooling))
    #   college elementary highschool 
    # 0.2716049  0.4259259  0.3024691 
    

    Depending on the level of precision you want, you could iterate until the proportions fall within a certain tolerance of your targets.

    df2 <- mutate(df2, wt = prop_sex[sex] * prop_school[schooling])
    
    max_iter <- 1000
    tol <- 0.05
    for (i in seq(max_iter)) {
      df3 <- slice_sample(df2, prop = 0.6, weight_by = wt)
      obs_sex <- prop.table(table(df3$sex))
      obs_school <- prop.table(table(df3$schooling))
      if (
        max(abs(obs_sex - prop_sex[names(obs_sex)])) < tol &&
        max(abs(obs_school - prop_school[names(obs_school)])) < tol
      ) break
      if (i == max_iter) warning("max iterations reached")
    }
    
    prop.table(table(df3$sex))
    #         f         m 
    # 0.6728395 0.3271605 
    
    prop.table(table(df3$schooling))
    #   college elementary highschool 
    # 0.2469136  0.4567901  0.2962963 
    

    You may also want to look into the survey or srvyr packages for working with weighted data.