Search code examples
rstatamosaic-plot

Construct new variable from >3 categorical variables (+maintain column names) for mosaic plot in Stata


My question is an extension of that found here: Construct new variable from given 5 categorical variables in Stata

I am an R user and I have been struggling to adjust to the Stata syntax. Also, I'm use to being able to Google for R documentation/examples online and haven't found as many resources for Stata so I've come here.

I have a data set where the rows represent individual people and the columns record various attributes of these people. There are 5 categorical variables (white, hispanic, black, asian, other) that have binary response data, 0 or 1 ("No" or "Yes"). I want to create a mosaic plot of race vs response data using the spineplots package. However, I believe I must first combine all 5 of the categorical variables into a categorical variable with 5 levels that maintains the labels (so I can see the response rate for each ethnicity.) I've been playing around with the egen function but haven't been able to get it to work. Any help would be appreciated.

Edit: Added a depiction of what my data looks like and what I want it to look like.

my data right now:

person_id,black,asian,white,hispanic,responded

1,0,0,1,0,0

2,1,0,0,0,0

3,1,0,0,0,1

4,0,1,0,0,1

5,0,1,0,0,1

6,0,1,0,0,0

7,0,0,1,0,1

8,0,0,0,1,1

what I want is to produce a table through the tabulate command to make the following:

respond, black, asian, white, hispanic
responded to survey |    20, 30, 25, 10, 15

did not respond     |    15, 20, 21, 23, 33

Solution

  • It seems like you want a single indicator variable rather than multiple {0,1} dummies. The easiest way is probably with a loop; another option is to use cond() to generate a new indicator variable (note that you may want to catch respondents for whom all the race dummies are 0 in an 'other' group), label its values (and the values of responded), and then create your frequency table:

    clear
    input person_id black asian white hispanic responded
    1 0 0 1 0 0
    2 1 0 0 0 0
    3 1 0 0 0 1
    4 0 1 0 0 1
    5 0 1 0 0 1
    6 0 1 0 0 0
    7 0 0 1 0 1
    8 0 0 0 1 1
    9 0 0 0 0 1
    end
    
    gen race = "other"
    foreach v of varlist black asian white hispanic {
        replace race = "`v'" if `v' == 1
    }
    
    label define race2 1 "asian" 2 "black" 3 "hispanic" 4 "white" 99 "other"
    gen race2:race2 = cond(black == 1, 1, ///
                    cond(asian == 1, 2, ///
                    cond(white == 1, 3, ///
                    cond(hispanic == 1, 4, 99))))
    
    label define responded 0 "did not respond" 1 "responded to survey"
    label values responded responded
    tab responded race
    

    with the result

                        |                          race
              responded |     asian      black   hispanic      other      white |     Total
    --------------------+-------------------------------------------------------+----------
        did not respond |         1          1          0          0          1 |         3 
    responded to survey |         2          1          1          1          1 |         6 
    --------------------+-------------------------------------------------------+----------
                  Total |         3          2          1          1          2 |         9 
    

    tab responded race2 yields the same results with a different ordering (by the actual values of race2 rather than the alphabetical ordering of the value labels).