Search code examples
statasurveyoutliers

Stata identify influential observations post svy regression


When using the Stata svy command, such as:

svy: logistic graduate age female i.math i.english

there are various follow-up steps that should be completed. For example, looking for significant outliers or high leverage points. Without the 'svy' element the following commands would work:

predict p
predict stdres, rstand
scatter stdres p, mlabel(snum) ylab(-4(2) 16) yline(0)

However, when the logistic regression was run with the svy preface it simply produces the following error:

option rstandard not allowed after svy estimation

Great. What is allowed? How does someone look at significant outlier or high leverage points?


Solution

  • @NickCox is right in his comment -- there is not much work done in extending diagnostics to complex survey settings. One of the reasons is that technically speaking, survey inference is nonparametric: the object of inference is not some idealized relation between variables, but the census regression, with all the "outliers" that the full population might have. There is no likelihood that will be badly affected by outliers; there are just estimating equations, and the standard errors are "robust" anyway (i.e. use the sandwich formula rather than the Hessian.)

    The work that is out there has mostly been done by Rick Valliant (R package svydiags: https://cran.r-project.org/web/packages/svydiags/, dissertation by his student Jianzhu Li: https://drum.lib.umd.edu/bitstream/handle/1903/7598/umi-umd-4863.pdf?sequence=1&isAllowed=y; there were some follow up papers published out of that dissertation I could not find right away.)

    (This all feels more like discussion for CrossValidated/stats rather than SO/Stata.)