I have a dataset which has negative values indicating missing (mostly in variables of byte type). I would like for all variables to replace all negative values with missing(.) values and subsequently drop
all variables for which these values are missing in more than 98% of the observations. By Googling I came up with the following, but I get an error.
type mismatch
r(109);
glo p= 0.98
foreach var of varlist* {
qui replace (`var') = . if (`var') < 0
qui count if missing(`var')
if (r(N)/_N) >= $p drop `var'
}
I was thinking about using something like
glo p= 0.98
foreach var of varlist* {
if (`var') == type(byte)
qui replace (`var') = . if (`var') < 0
qui count if missing(`var')
if (r(N)/_N) >= $p drop `var'
}
but then I get:
unknown function type()
r(133);
Any suggestions?
I would already be happy with a solution that first replaces all negative values and then removes variables where the majority is missing.
Negative values make no sense for string variables, which is presumably how the error arises, so you could have something like
ds, has(type numeric)
local N = 0.98 * _N
quietly foreach var in `r(varlist)' {
replace `var' = . if `var' < 0
count if missing(`var')
if r(N) >= `N' drop `var'
}
Notes:
You can use confirm
to check whether a variable is numeric or string, but filtering out numeric variables in one step is suggested here.
You refer to a global p
but never define it.
You can't set any constant or variable in Stata with statements of the form newname =
value. (Mata, yes!)
if
statement without an action makes no sense. Stata won't hold for later statements unless you use braces. Generally, it is best to give reproducible examples.