Search code examples
stata

How to summarise useful information from existing dataset and combine in a new one?


I am trying to summarise useful information from a survey dataset. This dataset contains information on the surveyed individuals' parents. One ID will associate four rows, containing information on their mother, father, mother-in-law and father-in-law. However, I am only interested in the surveyed person, rather than their parents.

* Example generated by -dataex-. To install: ssc install dataex
clear
input str12 ID byte(parentID ca001)
"010104101002" 1 2
"010104101002" 2 1
"010104101002" 3 1
"010104101002" 4 1
"010104102002" 1 2
"010104102002" 2 2
"010104102002" 3 2
"010104102002" 4 1
"010104103001" 1 2
"010104103001" 2 2
"010104103001" 3 2
"010104103001" 4 1
"010104104001" 1 2
"010104104001" 2 2
"010104104001" 3 2
"010104104001" 4 1
"010104105002" 1 2
"010104105002" 2 2
"010104105002" 3 2
"010104105002" 4 2
end
label values parentID parent
label def parent 1 "1 Father", modify
label def parent 2 "2 Mother", modify
label def parent 3 "3 Father-in-law", modify
label def parent 4 "4 Mother-in-law", modify
label values ca001 ca001
label def ca001 1 "1 Yes", modify
label def ca001 2 "2 No", modify

For example, ca001 represents whether the respondents' parents (mother/father/mother-in-law/father-in-law) are still alive. What I need is a dummy variable, indicating the number of the ID's parents that are still alive (0-4).

I need to get rid of repeated IDs and have one unique ID for one observation. This is because I need to merge this dataset with other datasets by matching the unique ID from one dataset to another.


Solution

  • This might work for you:

    bysort ID: egen alive_parents = total(-(ca001-2))
    keep ID alive_parents
    duplicates drop
    list
    
         +-------------------------+
         |     ID    alive_parents |
         |-------------------------|
      1. | 010104101002          3 |
      2. | 010104102002          1 |
      3. | 010104103001          1 |
      4. | 010104104001          1 |
      5. | 010104105002          0 |
         +-------------------------+
    

    The idea is to subtract 2 from ca001 so that 0 == No and -1 == Yes and then take the negative of that so 0 == No and 1 == Yes, then sum by ID to get the total number of alive parents.

    Then we drop the extra variables and are left with ID-alive_parents pairs that have 4 duplicates each, so we drop the duplicates.