Stata: Generating individual comparison groups for each observation in sample (age brackets)

Currently, I try to assign certain properties of a comparison group (i.e.: mean income) to each individual in my microdata sample. The comparison groups are defined by some other observables (gender, region) and generated by other individuals. So far, I coded:

     egen com_group = group(gender region)
     bysort com_group: egen com_income = mean(income)

This works so far, but, this way raises two issues:

As the mean is calculated for all individuals in a certain group and the current observation is part of her own group, its own income counts for the calculation of mean income of own reference group. This might raise a (little) bias. This problem seems minor compared to problem 2.
I would prefer to assign the average income of less static groups. More concrete, I’m thinking about generating comparison groups of type group(gender region age+-5years). So, this running age brackets can’t be solved in the above mentioned way as each observation of a different age has a different age bracket. This information can’t be saved in one variable like “ref_group” before. My idea was to loop over all observations and generate observation specific reference groups. But, I don’t really know how to do this…

Solution

Will this give you what you want? I have not checked details. I will add some explanation later. In this example, the range for age is +/- 1 and the grouping variable is race

clear all
set more off

*----- example data -----

input ///
    idcode   age   race       wage      
        45    35      1   10.18518   
        47    35      1   3.526568     
        48    35      1   5.852843     
         1    37      2   11.73913     
         2    37      2   6.400963     
         9    37      1   10.49114     
        36    37      1   4.180602     
         7    39      1    4.62963     
        15    39      1   16.79548     
        20    39      1   9.661837     
        12    40      1   17.20612     
        13    40      1   13.08374     
        14    40      1   7.745568     
        16    40      1   15.48309     
        18    40      1   5.233495     
        19    40      1   10.16103     
        97    40      2   19.92563    
        22    41      1   9.057972     
        24    41      1   11.09501     
        44    41      1   28.45666   
        98    41      2   4.098635    
         3    42      2   5.016723     
         6    42      1   8.083731     
        23    42      1    8.05153     
        25    42      1   9.581316     
        99    42      2   9.875124    
         4    43      1   9.033813     
        39    44      1   9.790657     
        46    44      1   3.051529     
end

sort age idcode
list, sepby(age)

*----- what you want -----

gen mwage = .
levelsof race, local(lrace)

forvalues i = 1/`=_N' {
    foreach j of local lrace {

        summarize wage if ///
            inrange(age, age[`i']-1, age[`i']+1) /// age condition
            & race == `j'                        /// race condition
            & _n != `i'                          /// self-exclude condition
            , meanonly

        replace mwage = r(mean) if race == `j' in `i'

    }
}

list, sepby(age)

Edit

If Stata is too slow with your database, then you can do it with Mata. Here is my attempt at it (I'm only starting to use it):

clear all
set more off

*----- example data -----

sysuse nlsw88
expand 2

*----- what you want -----

egen gro = group(race industry) // grouping variables

* Get number of groups
summarize gro, meanonly
local numgro = r(max)

* Compute upper limits for groups
forvalues i = 1/`numgro' {
     summarize gro if gro == `i', meanonly
     local countgro `countgro' `r(N)'
}

/*
sort group and bracking var. sort in Stata so Mata results
can be posted back to Stata using only -getmata-
*/

sort gro age 

* Take statistic and bracking variables to Mata
putmata STVAR=wage BRVAR=age 

mata:

/*
Get upper limits of groups from Stata.
Not considered good style. See Mata Matters: Macros, Gould (2008)
*/

UPLIM = tokens(st_local("countgro")) 
UPLIM = runningsum(strtoreal(UPLIM)) // upper limits of groups

/*
For example, in the following observation ranges, each line 
shows lower and upper limits:

1-11 
12-23 
24-28 
29-29 
*/


ST = J(rows(STVAR), 1, .)
for (i = 1; i <= cols(UPLIM); i++) {

    if (i == 1) {
        ro = 1
    }
    else {
        ro = UPLIM[i-1]+1
    }

    co = UPLIM[i]

    STVARP = STVAR[|ro\co|]     // statistic variable
    BRVARP = BRVAR[|ro\co|]     // bracket variable

    STPART = J(rows(STVARP), 1, 0)
    for (j = 1; j <= rows(BRVARP); j++) {

            SMALLER = BRVARP :>= BRVARP[j] - 1
            LARGER = BRVARP :<= BRVARP[j] + 1

            STPART[j] = ( sum(STVARP :* SMALLER :* LARGER) - STVARP[j] ) / ( sum(SMALLER :* LARGER) - 1 ) //division by zero gives . for last group with only one observation

    }

    ST[|ro\co|] = STPART // stack results
}

end

getmata mwage=ST

keep wage race industry gro age mwage
sort gro age wage

//list wage gro age matawage, sepby(gro)

Mata is formidable with loops; a database with 15.000 observations takes only a few seconds.