Search code examples
sasstataimputation

Simple way to do a weighted hot deck imputation in Stata?


I'd like to do a simple weighted hot deck imputation in Stata. In SAS the equivalent command would be the following (and note that this is a newer SAS feature, beginning with SAS/STAT 14.1 in 2015 or so):

proc surveyimpute method=hotdeck(selection=weighted); 

For clarity then, the basic requirements are:

  1. Imputations most be row-based or simultaneous. If row 1 donates x to row 3, then it must also donate y.

  2. Must account for weights. A donor with weight=2 should be twice as likely to be selected as a donor with weight=1

I'm assuming the missing data is rectangular. In other words, if the set of potentially missing variables consists of x and y then either both are missing or neither is missing. Here's some code to generate sample data.

global miss_vars "wealth income"
global weight    "weight"

set obs 6
gen id = _n
gen type = id > 3
gen income = 5000 * _n
gen wealth = income * 4 + 500 * uniform()
gen weight = 1
replace weight = 4 if mod(id-1,3) == 0

// set income & wealth missing every 3 rows
gen impute = mod(_n,3) == 0
foreach v in $miss_vars {
    replace `v' = . if impute == 1
}

Data looks like this:

            id       type     income     wealth     weight     impute
  1.         1          0       5000   20188.03          4          0
  2.         2          0      10000   40288.81          1          0
  3.         3          0          .          .          1          1
  4.         4          1      20000   80350.85          4          0
  5.         5          1      25000   100378.8          1          0
  6.         6          1          .          .          1          1

So in other words, we need to randomly (with weighting) select a donor of the same type observation for each row with missing values and use that donor to fill in both income and wealth values. In practical use the generation of the type variable is of course it's own problem, but I'm keeping that very simple here to focus on the main issue.

For example, row 3 might look like either of the following post hotdeck (because it fills both income and wealth from row 1, or from row 2 (but in contrast would never take income from row 1 and the wealth from row 2):

  3.         3          0       5000   20188.03          1          1
  3.         3          0      10000   40288.81          1          1

Also, since row 1 has weight=4 and row 2 has weight=1, row 1 should be the donor 80% of the time and row 2 should be the donor 20% of the time.


Solution

  • It appears there was no way to do this in Stata nor were there community-contributed commands either. There were community-contributed commands that did hotdecks (specifically, hotdeck, whotdeck, and hotdeckvar) but none of them handled sample weights. The whotdeck command superficially appeared to handle weights, but these are not sample weights but rather internally estimated importance weights.

    Hence I wrote a program myself and uploaded to github. It is called wtd_hotdeck. Please follow that link for more information and any subsequent updates.