Search code examples
rsparse-matrixsurveydivide-by-zero

Handling zero cell counts in the "survey" package


[Edited to include smaller reproducible example]

The Goal

I'm learning the R survey package to get cell proportions (specifically, row percentages) and confidence intervals for a two-way, 58 x 4 contingency table, with variance estimation taking into account the complex sample design. The 58 California counties are on the rows and a "sexual orientation" variable with four categories (straight, gay/lesbian, bisexual, and other) is on the columns.

The Problem

The problem is that several small counties have zero cell counts. When I try to get the table using the svyby method, I get the error "data length is not a sub-multiple or multiple of the number of rows."

Here's a reproducible example:

# Set the number of trials and the probabilities for each outcome
set.seed(16)
probs <- c(.94, .03, .02, .01)
n <- 10000

#Tables with 0s

#Create df
sogi <- data.frame(t(rmultinom(n, 1, prob = probs)))
names(sogi) <- c("straight","gay","bisexual","other")
df_sogi <- sogi %>% mutate(sexual_orientation = 
  as.factor(case_when(
    straight %in% 1 ~ 0,
    gay %in% 1 ~ 1,
    bisexual %in% 1 ~ 2,
    other %in% 1 ~ 3))) %>%
  mutate(pw = runif(n, 0.2, 3.5)) %>%
  mutate(county = as.factor(rep(LETTERS[1:10], each = n/length(LETTERS[1:10])))) %>%
  mutate(ind = row_number()) %>%
  select(ind, county, sexual_orientation, pw) 

#Check raw counts
table(df_sogi$county, df_sogi$sexual_orientation)

#Make survey design and survey design replicate objects
df_svy_z1 <- svydesign(id=~ind, data = df_sogi, weights = ~pw)

#Estimate cell means
svyby(formula = ~county, by = ~sexual_orientation, design = df_svy_z1, FUN = svymean)


#Modify table to put in zeroes
df_sogi$sexual_orientation <- ifelse(df_sogi$county %in% c("G","I") & df_sogi$sexual_orientation==3, 
       df_sogi$sexual_orientation == 0, df_sogi$sexual_orientation)

Check raw counts
table(df_sogi$county, df_sogi$sexual_orientation)

df_svy_z2 <- svydesign(id=~ind, data = df_sogi, weights = ~pw)
svyby(~factor(county), ~factor(sexual_orientation), df_svy_z2, svymean)


Solution

  • Your problem is that county is not a factor. It's getting converted to a factor separately for each subset as the call is made to svymean, but since that call is just for a subset, the factor levels are just the levels present in the subset.

    The whole point of factors is that they know what their possible levels are, and you lose this whole point when do the conversion this way. You either have to convert in advance

    df_svy_z2<-update(df_svy_z1, countyf=factor(county))
    svyby(~factor(countyf), ~factor(sexual_orientation), df_svy_z2, svymean)
    

    or list the possible levels explicitly in the call to factor()

    svyby(~factor(county,levels=LETTERS[1:10]), ~factor(sexual_orientation), df_svy_z2, svymean)
    

    This used not to be a problem very often, because the conversion to factor was done automatically when data were read in; very few people had character strings in their datasets. Since the change to stringsAsFactors=FALSE as the R default, it's increasingly common for people to have variables that should be factors but that are just represented as character strings.

    The problem is sufficiently common that there's a note on the help page for svyby

    The function works by making a lot of calls of the form FUN(formula, subset(design, by==i)), where formula is re-evaluated in each subset, so it is unwise to use data-dependent terms in formula. In particular, svyby(~factor(a), ~b, design=d, svymean), will create factor variables whose levels are only those values of a present in each subset. If a is a character variable then svyby(~a, ~b, design=d, svymean) creates factor variables implicitly and so has the same problem. Either use update.survey.design to add variables to the design object instead or specify the levels explicitly in the call to factor. The stringsAsFactors=TRUE option converts all character variables to factors, which can be slow, set it to FALSE if you have predefined factors where necessary.

    As the note says, a third option is

    svyby(~factor(county), ~factor(sexual_orientation), df_svy_z2, svymean, stringsAsFactors=TRUE)
    

    but that's not best practice because it converts all strings to factors and because it assumes you want the set of factor levels present in the data passed to svyby and in their default order.