Search code examples
stata

Stata alternatives for lookup


I have a large Stata dataset that contains the following variables: year, state, household_id, individual_id, partner_id, and race. Here is an example of my data:

 year  state  household_id  individual_id  partner_id  race
 1980  CA     23            2              1           3
 1980  CA     23            1              2           1
 1990  NY     43            4              2           1
 1990  NY     43            2              4           1

Note that, in the above table, column 1 and 2 are married to each other. I want to create a variable that is one if the person is in an interracial marriage.

As a first step, I used the following code

by household_id year: gen inter=0 if race==race[partner_id]
replace inter=1 if inter==.

This code worked well but gave the wrong result in a few cases. As an alternative, I created a string variable identifying each user and its partner, using

 gen id_user=string(household_id)+"."+string(individual_id)+string(year)
 gen id_partner=string(household_id)+"."+string(partner_id)+string(year)

What I want to do now is to create something like what vlookup does in Excel: for each column, save locally the id_partner, find it in the id_user and find their race, and compare it with the race of the original user.

I guess it should be something like this?

gen inter2==1 if (find race[idpartner]) == (race[iduser])

The expected output should be like this

 year  state  household_id  individual_id  partner_id  race  inter2
 1980  CA     23            2              1           3     1 
 1980  CA     23            1              2           1     1
 1990  NY     43            4              2           1     0
 1990  NY     43            2              4           1     0

Solution

  • I don't think you need anything so general. As you realise, the information on identifiers suffices to find couples, and that in turn allows comparison of race for the people in each couple.

    In the code below _N == 2 is meant to catch data errors, such as one partner but not the other being an observation in the dataset or repetitions of one partner or both.

    clear 
    
    input year  str2 state  household_id  individual_id  partner_id  race
     1980  CA     23            2              1           3
     1980  CA     23            1              2           1
     1990  NY     43            4              2           1
     1990  NY     43            2              4           1 
     end 
    
    generate couple_id = cond(individual_id < partner_id, string(individual_id) + ///
                              " " + string(partner_id), string(partner_id) + ///
                              " " + string(individual_id)) 
    
    bysort state year household_id couple_id : generate mixed = race[1] != race[2] if _N == 2 
    
    list, sepby(household_id) abbreviate(15)
    
         +-------------------------------------------------------------------------------------+
         | year   state   household_id   individual_id   partner_id   race   couple_id   mixed |
         |-------------------------------------------------------------------------------------|
      1. | 1980      CA             23               2            1      3         1 2       1 |
      2. | 1980      CA             23               1            2      1         1 2       1 |
         |-------------------------------------------------------------------------------------|
      3. | 1990      NY             43               4            2      1         2 4       0 |
      4. | 1990      NY             43               2            4      1         2 4       0 |
         +-------------------------------------------------------------------------------------+
    

    This idea is documented in this article. The link gives free access to a pdf file.