Search code examples
labelencodestata

Encode a string variable in non-alphanumeric order


I want to encode a string variable in such a way that assigned numerical codes respect the original order of string values (as shown when using browse). Why? I need encoded variable labels to get the correct variable names when using reshape wide.

Suppose var is a string variable with no labels:

var      label(var)  
"zoo"    none  
"abc"    none

If you start with:

encode var, gen(var2)

the labels are 1="abc" 2="zoo" as can be seen with

label li

But I want the labels sorted as they come, as shown in browse for an unchanged order of variables later.

I didn't find an encode option in which the labels are added in the order I see when using browse.

My best idea is to do it by hand:

ssc install labutil
labvalch var, f(1 2) t(2 1)

This is nice, but I have >50 list entries.

Other approach: When using reshape use another order, but I don't think that works.

reshape wide x, i(id) j(var)

I only found

ssc install labutil
labmask regioncode, values(region)

as some alternative to encode but I'm not able to cope with strings using labmask.


Solution

  • First off, it's a rule in Stata that string variables can't have value labels. Only numeric variables can have value labels. In essence, what you want as value labels are already in your string variable as string values. So, the nub of the problem is that you need to create a numeric variable with values in the right order.

    Let's solve the problem in its easiest form: string values occur once and once only. So

    gen long order = _n 
    labmask order, values(var) 
    

    then solves the problem, as the numeric values 1, 2, ... are linked with the string values zoo, abc, whatever, which become value labels. Incidentally, a better reference for labmask, one of mine, is http://www.stata-journal.com/sjpdf.html?articlenum=gr0034

    Now let's make it more complicated. String values might occur once or more times, but we want the numeric variable to respect first occurrence in the data.

    gen long order1 = _n
    egen order2 = min(order1), by(var) 
    egen order = group(order2) 
    labmask order, values(var) 
    

    Here's how that works.

    gen long order1 = _n 
    

    puts the observation numbers 1, 2, whatever in a new variable.

    egen order2 = min(order1), by(var)
    

    finds the first occurrence of each distinct value of var.

    egen order = group(order2) 
    

    maps those numbers to 1, 2, whatever.

    labmask order, values(var)
    

    links the numeric values of order and the string values of var, which become its value labels.

    Here is an example of how that works in practice.

    . l, sep(0)
    
        +---------------------------------+
        |   var   order1   order2   order |
        |---------------------------------|
     1. |   zoo        1        1     zoo |
     2. |   abc        2        2     abc |
     3. |   zoo        3        1     zoo |
     4. |   abc        4        2     abc |
     5. |   new        5        5     new |
     6. | newer        6        6   newer |
        +---------------------------------+
    
    . l, nola sep(0)
    
        +---------------------------------+
        |   var   order1   order2   order |
        |---------------------------------|
     1. |   zoo        1        1       1 |
     2. |   abc        2        2       2 |
     3. |   zoo        3        1       1 |
     4. |   abc        4        2       2 |
     5. |   new        5        5       3 |
     6. | newer        6        6       4 |
        +---------------------------------+
    

    You would drop order1 order2 once you have got the right answer.

    See also sencode for another solution. (search sencode to find references and download locations.)