Search code examples
statalabel

Drop redundant value labels


I have a part of big dataset. Many variables contain value labels, but such values are not present in this part of dataset. I would like to remove the redundant value labels from the dataset. I tried to do that in Stata using various approaches but did not succeed.

Apparently this does not work:

label drop X if X == 1 

Added text: So far I came with the following solutions which are not perfect because I need to repeat this exercise again and again in future:

First (semi-manual):

fre var
di r(lab_valid);
label drop var;
label define var 1 "Label 1" 2 "Label 2" 3 "label 3", modify.

Second (X is a label code that need to be kept. The problem is that I have multiple that need to be kept):

labellist var
local min = r(var_min)
local max = r(var_max)
forval i = `min'/`max' {
    if `i' != X {
        label define var `i' "", modify
    }
}

Solution

  • No "apparently" about it: that is not legal code, nor does it even make sense in principle. At best label drop drops named labels, but the name of the labels and the name of any variable they are attached to do not coincide unless you have set it up that way.

    This is dubious:

    1. Stata doesn't use a lot of memory storing value labels in most cases. Much of the point of value labels is that a value label need only be stored once.

    2. This kind of question seems to imply that value labels were set up before you came along and that each value might find an observation to stick to. That was very possibly wise thinking.

    This is dangerous:

    1. The same value labels may be used for more than one variable, so in principle you need to check for use on all the variables that use a particular set.

    2. You need to worry about what might happen if you append or merge with similar datasets. That could lead to more mess than you want.

    3. Less biting, but also worth mentioning, is that a value label that isn't in the data might still be useful for graphical purposes.

    So, I don't advise what you're thinking of. You could try a decode of each variable with value labels and then an encode based on those values. But the value labels wouldn't necessarily be in a desired order. By default encode would use alphabetical order and you end up with nonsense like 1 "Acceptable" 2 "Bad" 3 "Good" or 1 "Agree" 2 "Disagree" 3 "Neutral". It's possible to imagine ending up with more labels than you started with.

    There are other ways to do it properly, but it's a small project.

    Executive summary: Sorry, but that doesn't sound like a good idea.

    EDIT: This is hacked out of dataex. It should work for various versions <15.

    *! 1.0.0 NJC 11apr2018 
    program showvaluelabelsused 
        version 15 
        syntax [varlist] 
    
        quietly ds, has(vallabel) 
    
        foreach v in `r(varlist)'  {
            local l : value label `v'
            local vlabels : list vlabels | l
        }
    
        foreach vl in `vlabels' {
            local alllevels
            qui ds , has(vallabel `vl')
            local vlist `r(varlist)'
            foreach v in `vlist' {
                qui levelsof `v', local(levels) missing
                local alllevels : list alllevels | levels
                dis as res "label values `v' `vl'"
            }
    
            foreach n in `alllevels' {
                local ltext : label `vl' `n', strict
                if `"`ltext'"' != "" {
                    if strpos(`"`ltext'"',char(34)) dis as res `"label def `vl' `n' `"`ltext'"', modify"'
                    else dis as res `"label def `vl' `n' "`ltext'", modify"'
                }
            }
        }
    end 
    
    . sysuse auto, clear
    (1978 Automobile Data)
    
    . showvaluelabelsused
    foreign
    label values foreign origin
    label def origin 0 "Domestic", modify
    label def origin 1 "Foreign", modify
    
    . keep if foreign
    (52 observations deleted)
    
    . showvaluelabelsused
    label values foreign origin
    label def origin 1 "Foreign", modify
    
    . webuse nlswork, clear
    (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
    
    . showvaluelabelsused
    label values race racelbl
    label def racelbl 1 "white", modify
    label def racelbl 2 "black", modify
    label def racelbl 3 "other", modify
    
    . keep if race == 2
    (20,483 observations deleted)
    
    . showvaluelabelsused
    label values race racelbl
    label def racelbl 2 "black", modify