Search code examples
rsurvey

SCF data issue from lodown package


I found a very weird issue when I was analyzing the SCF using lodown package. Something must go wrong with the data of the group of black people, age less than 35, education level of some college. The share/mean of this group is too high.

I tried to put three factors, race, age and education, together to see the share of a certain group of total wealth for the total population.

# input data
scf_imp <- readRDS( file.path( path.expand( "~" ) , "SCF" , "scf 2016.rds" ) )

scf_rw <- readRDS( file.path( path.expand( "~" ) , "SCF" , "scf 2016 rw.rds" ) )

scf_design <-
  svrepdesign(
    weights = ~wgt ,
    repweights = scf_rw[ , -1 ] ,
    data = imputationList( scf_imp ) ,
    scale = 1 ,
    rscales = rep( 1 / 998 , 999 ) ,
    mse = FALSE ,
    type = "other" ,
    combined.weights = TRUE
  )

# Variable Recoding
scf_design <- update(scf_design ,

                     racecl4 = factor(racecl4 ,
                                      labels = c("White" ,
                                                 "Black" ,
                                                 "Hispanic/Latino" ,
                                                 "Other" )),
                     edcl = factor(edcl ,
                                   labels = c("less than high school" ,
                                              "high school or GED" ,
                                              "some college" ,
                                              "college degree" )),
                     agecl = factor(agecl ,
                                    labels = c("less than 35" ,
                                               "35-44" ,
                                               "45-54" ,
                                               "55-64" ,
                                               "65-74" ,
                                               "75 or more"))
)
# calculation
trible <- scf_MIcombine( with( scf_design ,
                               svyby( ~ networth , ~ interaction(racecl4 , edcl , agecl) , svytotal )
) )

sum_black <- trible[[1]][str_detect(names(trible[[1]]),"Black")] %>% sum()
black <- trible[[1]][str_detect(names(trible[[1]]),"Black")] %>% matrix(nrow = 4)
black <- as.data.frame(black/sum_black)
colnames(black) <- c("less than 35" , "35-44" , "45-54" , "55-64" ,"65-74" , "75 or more")
black <- black %>% mutate(total = rowSums(black))
black <- rbind(black,total = colSums(black))
black <- sapply(black,scales::percent) %>% as.data.frame()
rownames(black) <- c("less than high school" , "high school or GED" , "some college" , "college degree", "total" )
black <- rownames_to_column(black,"share for black")

I applied the same method to calculate the mean. The result showed that the group of black people, age less than 35, and education level of some college, has a very high share/mean value. But it should not be. Is there something wrong with the data or the method I used?


(source: sinaimg.cn)


(source: sinaimg.cn)


Solution

  • the survey of consumer finances is about 6,000 unweighted records and you're breaking your results into almost 100 groups so that's going to be only N=60 on average populating each cell. take a look at this to see how small it is.

    counts <- scf_MIcombine( with( scf_design ,
                                   svyby( ~ networth , ~ interaction(racecl4 , edcl , agecl) , unwtd.count )
    ) )
    

    not a hard and fast rule but if a standard error is more than 30% of a statistic, that statistic might be unstable. take a look at SE( trible ) / coef( trible ) > 0.3 and you'll see that almost all of your statistics are unstable.

    SCF is an amazing dataset, but the sample sizes probably aren't big enough to support such a precise breakout.. thanks