I have a large Stata dataset that contains the following variables: year
, state
, household_id
, individual_id
, partner_id
, and race
. Here is an example of my data:
year state household_id individual_id partner_id race
1980 CA 23 2 1 3
1980 CA 23 1 2 1
1990 NY 43 4 2 1
1990 NY 43 2 4 1
Note that, in the above table, column 1 and 2 are married to each other. I want to create a variable that is one if the person is in an interracial marriage.
As a first step, I used the following code
by household_id year: gen inter=0 if race==race[partner_id]
replace inter=1 if inter==.
This code worked well but gave the wrong result in a few cases. As an alternative, I created a string variable identifying each user and its partner, using
gen id_user=string(household_id)+"."+string(individual_id)+string(year)
gen id_partner=string(household_id)+"."+string(partner_id)+string(year)
What I want to do now is to create something like what vlookup
does in Excel: for each column, save locally the id_partner, find it in the id_user and find their race, and compare it with the race of the original user.
I guess it should be something like this?
gen inter2==1 if (find race[idpartner]) == (race[iduser])
The expected output should be like this
year state household_id individual_id partner_id race inter2
1980 CA 23 2 1 3 1
1980 CA 23 1 2 1 1
1990 NY 43 4 2 1 0
1990 NY 43 2 4 1 0
I don't think you need anything so general. As you realise, the information on identifiers suffices to find couples, and that in turn allows comparison of race
for the people in each couple.
In the code below _N == 2
is meant to catch data errors, such as one partner but not the other being an observation in the dataset or repetitions of one partner or both.
clear
input year str2 state household_id individual_id partner_id race
1980 CA 23 2 1 3
1980 CA 23 1 2 1
1990 NY 43 4 2 1
1990 NY 43 2 4 1
end
generate couple_id = cond(individual_id < partner_id, string(individual_id) + ///
" " + string(partner_id), string(partner_id) + ///
" " + string(individual_id))
bysort state year household_id couple_id : generate mixed = race[1] != race[2] if _N == 2
list, sepby(household_id) abbreviate(15)
+-------------------------------------------------------------------------------------+
| year state household_id individual_id partner_id race couple_id mixed |
|-------------------------------------------------------------------------------------|
1. | 1980 CA 23 2 1 3 1 2 1 |
2. | 1980 CA 23 1 2 1 1 2 1 |
|-------------------------------------------------------------------------------------|
3. | 1990 NY 43 4 2 1 2 4 0 |
4. | 1990 NY 43 2 4 1 2 4 0 |
+-------------------------------------------------------------------------------------+
This idea is documented in this article. The link gives free access to a pdf
file.