Search code examples
loopsreferencecomparisonstata

Stata: Generating individual comparison groups for each observation in sample (age brackets)


Currently, I try to assign certain properties of a comparison group (i.e.: mean income) to each individual in my microdata sample. The comparison groups are defined by some other observables (gender, region) and generated by other individuals. So far, I coded:

     egen com_group = group(gender region)
     bysort com_group: egen com_income = mean(income)

This works so far, but, this way raises two issues:

  1. As the mean is calculated for all individuals in a certain group and the current observation is part of her own group, its own income counts for the calculation of mean income of own reference group. This might raise a (little) bias. This problem seems minor compared to problem 2.

  2. I would prefer to assign the average income of less static groups. More concrete, I’m thinking about generating comparison groups of type group(gender region age+-5years). So, this running age brackets can’t be solved in the above mentioned way as each observation of a different age has a different age bracket. This information can’t be saved in one variable like “ref_group” before. My idea was to loop over all observations and generate observation specific reference groups. But, I don’t really know how to do this…


Solution

  • Will this give you what you want? I have not checked details. I will add some explanation later. In this example, the range for age is +/- 1 and the grouping variable is race

    clear all
    set more off
    
    *----- example data -----
    
    input ///
        idcode   age   race       wage      
            45    35      1   10.18518   
            47    35      1   3.526568     
            48    35      1   5.852843     
             1    37      2   11.73913     
             2    37      2   6.400963     
             9    37      1   10.49114     
            36    37      1   4.180602     
             7    39      1    4.62963     
            15    39      1   16.79548     
            20    39      1   9.661837     
            12    40      1   17.20612     
            13    40      1   13.08374     
            14    40      1   7.745568     
            16    40      1   15.48309     
            18    40      1   5.233495     
            19    40      1   10.16103     
            97    40      2   19.92563    
            22    41      1   9.057972     
            24    41      1   11.09501     
            44    41      1   28.45666   
            98    41      2   4.098635    
             3    42      2   5.016723     
             6    42      1   8.083731     
            23    42      1    8.05153     
            25    42      1   9.581316     
            99    42      2   9.875124    
             4    43      1   9.033813     
            39    44      1   9.790657     
            46    44      1   3.051529     
    end
    
    sort age idcode
    list, sepby(age)
    
    *----- what you want -----
    
    gen mwage = .
    levelsof race, local(lrace)
    
    forvalues i = 1/`=_N' {
        foreach j of local lrace {
    
            summarize wage if ///
                inrange(age, age[`i']-1, age[`i']+1) /// age condition
                & race == `j'                        /// race condition
                & _n != `i'                          /// self-exclude condition
                , meanonly
    
            replace mwage = r(mean) if race == `j' in `i'
    
        }
    }
    
    list, sepby(age) 
    

    Edit

    If Stata is too slow with your database, then you can do it with Mata. Here is my attempt at it (I'm only starting to use it):

    clear all
    set more off
    
    *----- example data -----
    
    sysuse nlsw88
    expand 2
    
    *----- what you want -----
    
    egen gro = group(race industry) // grouping variables
    
    * Get number of groups
    summarize gro, meanonly
    local numgro = r(max)
    
    * Compute upper limits for groups
    forvalues i = 1/`numgro' {
         summarize gro if gro == `i', meanonly
         local countgro `countgro' `r(N)'
    }
    
    /*
    sort group and bracking var. sort in Stata so Mata results
    can be posted back to Stata using only -getmata-
    */
    
    sort gro age 
    
    * Take statistic and bracking variables to Mata
    putmata STVAR=wage BRVAR=age 
    
    mata:
    
    /*
    Get upper limits of groups from Stata.
    Not considered good style. See Mata Matters: Macros, Gould (2008)
    */
    
    UPLIM = tokens(st_local("countgro")) 
    UPLIM = runningsum(strtoreal(UPLIM)) // upper limits of groups
    
    /*
    For example, in the following observation ranges, each line 
    shows lower and upper limits:
    
    1-11 
    12-23 
    24-28 
    29-29 
    */
    
    
    ST = J(rows(STVAR), 1, .)
    for (i = 1; i <= cols(UPLIM); i++) {
    
        if (i == 1) {
            ro = 1
        }
        else {
            ro = UPLIM[i-1]+1
        }
    
        co = UPLIM[i]
    
        STVARP = STVAR[|ro\co|]     // statistic variable
        BRVARP = BRVAR[|ro\co|]     // bracket variable
    
        STPART = J(rows(STVARP), 1, 0)
        for (j = 1; j <= rows(BRVARP); j++) {
    
                SMALLER = BRVARP :>= BRVARP[j] - 1
                LARGER = BRVARP :<= BRVARP[j] + 1
    
                STPART[j] = ( sum(STVARP :* SMALLER :* LARGER) - STVARP[j] ) / ( sum(SMALLER :* LARGER) - 1 ) //division by zero gives . for last group with only one observation
    
        }
    
        ST[|ro\co|] = STPART // stack results
    }
    
    end
    
    getmata mwage=ST
    
    keep wage race industry gro age mwage
    sort gro age wage
    
    //list wage gro age matawage, sepby(gro)
    

    Mata is formidable with loops; a database with 15.000 observations takes only a few seconds.