Search code examples
labelmatchstatalookup-tables

How to keep observations match after value label modified?


I'm doing some basic data processing stuff with Stata. However, I get stuck on how to keep the observations of the other variables matching, after modifying a variable's value label.

For example, say, I have the raw dataset as follows:

   var1   var2   var3
    US    1000    15
  China    500    20
    UK     800    10

The var1 is a string variable. I need to convert it into a numeric one. So I typed

   encode var1, gen(country)    // where the new variable country is numeric

The variables that I want for future work are var2, var3 and country.

When the new variable is generated, it automatically has a value label. To be specific, if I click on US in the column of country, it shows 3 in the cell on top of the column.

   var1   var2   var3  country
    US    1000    15     US
  China    500    20   China
    UK     800    10     UK

So far it's good. I mean for each row the observations match.

However, this automatically generated value label is not exactly what I want. What I need is to make 1 denote UK, 2 for US and 3 for China. So I modified it by:

     label define country 1 "UK" 2 "US" 3 "China", modify

As shown below, I get the value labels that I want. Like, a click on China gives me 3.

     var1   var2   var3  country
      US    1000    15    China
    China    500    20     UK
      UK     800    10     US

But the point is, as you may already notice, the order of the observations for country is arranged alphabetically. It makes the corresponding values of var2 and var3 for each country to be wrong.

That is exactly where I could not get through. Since the columns that I finally need are var2, var3 and country, I'd like to have each country with its corresponding values of var2, var3 and as well the customized value labels.

I tried sort, order and a couple of commands but none gave me the result. I don't know what command I should use or there's something wrong in my earlier work.


Solution

  • Using the label() option of encode, is one way:

    clear
    set more off
    
    input ///
    str10 var1   var2   var3
      US      1000    15
      China    500    20
      UK       800    10
    end
    
    label define lblcountry 1 "UK" 2 "US" 3 "China"
    encode var1, generate(country) label(lblcountry)
    
    list
    list, nolabel
    

    See also the user-written command sencode (super-encode), which can do more: ssc describe sencode.