Search code examples
replaceforeachstata

Stata: Replace negative values with missing and drop var if replaced values missing more than 98% of the time


I have a dataset which has negative values indicating missing (mostly in variables of byte type). I would like for all variables to replace all negative values with missing(.) values and subsequently drop all variables for which these values are missing in more than 98% of the observations. By Googling I came up with the following, but I get an error.

type mismatch
r(109);

glo p= 0.98
foreach var of varlist* {
    qui replace (`var') = . if (`var') < 0
    qui count if missing(`var')      
    if (r(N)/_N) >= $p drop `var'    
}

I was thinking about using something like

glo p= 0.98
foreach var of varlist* {
    if (`var') == type(byte)
    qui replace (`var') = . if (`var') < 0
    qui count if missing(`var')      
    if (r(N)/_N) >= $p drop `var'    
}

but then I get:

unknown function type()
r(133);

Any suggestions?

I would already be happy with a solution that first replaces all negative values and then removes variables where the majority is missing.


Solution

  • Negative values make no sense for string variables, which is presumably how the error arises, so you could have something like

    ds, has(type numeric) 
    local N = 0.98 * _N 
    quietly foreach var in `r(varlist)' {
        replace `var' = . if  `var' < 0
        count if missing(`var')      
        if r(N) >= `N' drop `var'    
    }
    

    Notes:

    1. You can use confirm to check whether a variable is numeric or string, but filtering out numeric variables in one step is suggested here.

    2. You refer to a global p but never define it.

    3. You can't set any constant or variable in Stata with statements of the form newname = value. (Mata, yes!)

    4. An if statement without an action makes no sense. Stata won't hold for later statements unless you use braces.

    Generally, it is best to give reproducible examples.