Search code examples
rstatisticssummary

Adjusting for sampling weights in table1 in R


I would like to write a custom rendering funtcion that shows summary statistics weighted by sampling weights using the table1 package in R.

The problem here is that in addition to the vector with the values that the summary statistics are based on, you somehow need to hand over the vector with the corresponding weights, and this is not straightforward to me.

I tried handing over the whole data set with an extra argument and using it to extract weights and variable values. The problem here is that displaying summary statistics for different groups doesn't work anymore, since the reference data set is always the one that I provide.

Here is what I did:


dat1 <- data.frame(disease = factor(rbinom(100, 1, 0.2),
                                    labels =c("No", "Yes")),
                   age = rnorm(100, 45, 15),
                   sample.weight = rexp(100, 1),
                   group = factor(rbinom(100, 1, 0.4),
                                  labels = c("Group 1", "Group 2"))) 



my.render.with.weights <- function(x, name, weightsvar, data2, ...) {
  
  
  values <- data2[[name]]
  selected_weights <- data2[[weightsvar]]
  
  # Continuous variables
  
  if(is.numeric(data2[[name]])){
    
    skwnss <- skewness(data2[[name]], na.rm = TRUE) 
    
    weightedmean <- round(weightedMean(values, weights = selected_weights, na.rm = TRUE), 2)
    weightedsd <- round(sqrt(wtd.var(values, weights = selected_weights, na.rm = TRUE)), 2)
    
    weightedmedian <- round(weightedMedian(values, weights = selected_weights, na.rm = TRUE), 2)
    weightedquantile <- round(weightedQuantile(values, weights = selected_weights, probs = c(0.25, 0.75), na.rm = TRUE), 2)
    
    out.skewed <- with(stats.apply.rounding(stats.default(values), digits=2), c("",
                                                                                `Median [Q1, Q3], weighted` = sprintf("%s [%s, %s]", weightedmedian,
                                                                                                                      weightedquantile[1], weightedquantile[2])))
    
    out.not.skewed <- with(stats.apply.rounding(stats.default(values), digits=2), c("",
                                                                                    "Mean (SD), weighted"=sprintf("%s (&plusmn; %s)", weightedmean, weightedsd)), render.missing = NULL)
    
    
    
    if(skwnss > 0.4){
      return(out.skewed)
    }else{
      return(out.not.skewed)
    }
    
  }else{
    
    # Categorical variables
    
    weighted.frequency <- aggregate(selected_weights ~ values, FUN = function(x) round(sum(x), 1))
    weighted.percent <- round(wpct(values, selected_weights), 2)
    
    out.categorical <-  c("", paste0(weighted.frequency$values, ", weighted: ",
                                     weighted.frequency$selected_weights,
                                     " (", weighted.percent, "%)"))
    
    names(out.categorical) <- c("", levels(values))
    
    
    
    if (length(levels(values)) == 2)
      return(out.categorical[3])
    else return(out.categorical)
    
  }
}



table1(~age + disease |group, data = dat1,
       data2 = dat1, weightsvar = "sample.weight", render = my.render.with.weights)

Here is what I got:

Output of my custom rendering function

I would like this output, but with the respective statistics for each group.


Solution

  • You can perform the calculations with the surveytable package, though the output will look differently than it looks with table1. (I'm assuming that what you're after is the results, not the specific look of the table.)

    Data:

    set.seed(123)
    
    dat1 <- data.frame(disease = factor(rbinom(100, 1, 0.2),
                                        labels =c("No", "Yes")),
                       age = rnorm(100, 45, 15),
                       sample.weight = rexp(100, 1),
                       group = factor(rbinom(100, 1, 0.4),
                                      labels = c("Group 1", "Group 2")))
    
    

    Create a survey object from this data frame. The survey object contains information about the weights, which is the issue that you were having:

    mysurvey = survey::svydesign(ids = ~ 1, weights = ~sample.weight, data = dat1)
    

    Start up surveytable:

    library(surveytable)
    set_survey(mysurvey, opts = "general")
    
                                        Survey info {mysurvey}                                    
    ┌───────────┬──────────────┬─────────────────────────────────────────────────────────────────┐
    │ Variables │ Observations │ Design                                                          │
    ├───────────┼──────────────┼─────────────────────────────────────────────────────────────────┤
    │         4 │          100 │ Independent Sampling design (with replacement)                  │
    │           │              │ mysurvey = survey::svydesign(ids = ~ 1, weights =               │
    │           │              │ ~sample.weight, data = dat1)                                    │
    └───────────┴──────────────┴─────────────────────────────────────────────────────────────────┘
    
    

    Overall summary stats:

    tab("disease", "age")
    
                             disease {mysurvey}                          
    ┌───────┬────┬────────┬────┬────┬─────┬─────────┬─────┬──────┬──────┐
    │ Level │  n │ Number │ SE │ LL │  UL │ Percent │  SE │   LL │   UL │
    ├───────┼────┼────────┼────┼────┼─────┼─────────┼─────┼──────┼──────┤
    │ No    │ 82 │     88 │  9 │ 71 │ 109 │    83.6 │ 4.7 │ 72.1 │ 91.7 │
    ├───────┼────┼────────┼────┼────┼─────┼─────────┼─────┼──────┼──────┤
    │ Yes   │ 18 │     17 │  5 │  9 │  33 │    16.4 │ 4.7 │  8.3 │ 27.9 │
    └───────┴────┴────────┴────┴────┴─────┴─────────┴─────┴──────┴──────┘
      N = 100.                                                           
    
             age {mysurvey}         
    ┌─────────┬──────┬──────┬──────┐
    │ % known │ Mean │  SEM │   SD │
    ├─────────┼──────┼──────┼──────┤
    │     100 │ 45.5 │ 1.94 │ 14.8 │
    └─────────┴──────┴──────┴──────┘
    

    Summary stats by group:

    tab_subset("disease", "group")
    tab_subset("age", "group")
    
                    disease (group = Group 1) {mysurvey}                
    ┌───────┬────┬────────┬────┬────┬────┬─────────┬─────┬──────┬──────┐
    │ Level │  n │ Number │ SE │ LL │ UL │ Percent │  SE │   LL │   UL │
    ├───────┼────┼────────┼────┼────┼────┼─────────┼─────┼──────┼──────┤
    │ No    │ 47 │     48 │  8 │ 34 │ 67 │    81.7 │ 6.3 │ 65.7 │ 92.4 │
    ├───────┼────┼────────┼────┼────┼────┼─────────┼─────┼──────┼──────┤
    │ Yes   │ 13 │     11 │  4 │  4 │ 26 │    18.3 │ 6.3 │  7.6 │ 34.3 │
    └───────┴────┴────────┴────┴────┴────┴─────────┴─────┴──────┴──────┘
      N = 60.                                                           
    
                    disease (group = Group 2) {mysurvey}                
    ┌───────┬────┬────────┬────┬────┬────┬─────────┬─────┬──────┬──────┐
    │ Level │  n │ Number │ SE │ LL │ UL │ Percent │  SE │   LL │   UL │
    ├───────┼────┼────────┼────┼────┼────┼─────────┼─────┼──────┼──────┤
    │ No    │ 35 │     40 │  8 │ 26 │ 61 │    85.9 │ 6.8 │ 66.6 │ 96.3 │
    ├───────┼────┼────────┼────┼────┼────┼─────────┼─────┼──────┼──────┤
    │ Yes   │  5 │      7 │  3 │  1 │ 36 │    14.1 │ 6.8 │  3.7 │ 33.4 │
    └───────┴────┴────────┴────┴────┴────┴─────────┴─────┴──────┴──────┘
      N = 40.                                                           
    
    age (for different levels of group) {mysurvey}
    ┌─────────┬─────────┬──────┬──────┬──────┐
    │ Level   │ % known │ Mean │  SEM │   SD │
    ├─────────┼─────────┼──────┼──────┼──────┤
    │ Group 1 │     100 │ 46.5 │ 2.9  │ 16.6 │
    ├─────────┼─────────┼──────┼──────┼──────┤
    │ Group 2 │     100 │ 44.1 │ 2.41 │ 12.3 │
    └─────────┴─────────┴──────┴──────┴──────┘