Currently, I try to assign certain properties of a comparison group (i.e.: mean income) to each individual in my microdata sample. The comparison groups are defined by some other observables (gender, region) and generated by other individuals. So far, I coded:
egen com_group = group(gender region)
bysort com_group: egen com_income = mean(income)
This works so far, but, this way raises two issues:
As the mean is calculated for all individuals in a certain group and the current observation is part of her own group, its own income counts for the calculation of mean income of own reference group. This might raise a (little) bias. This problem seems minor compared to problem 2.
I would prefer to assign the average income of less static groups. More concrete, I’m thinking about generating comparison groups of type group(gender region age+-5years). So, this running age brackets can’t be solved in the above mentioned way as each observation of a different age has a different age bracket. This information can’t be saved in one variable like “ref_group” before. My idea was to loop over all observations and generate observation specific reference groups. But, I don’t really know how to do this…
Will this give you what you want? I have not checked details. I will add some explanation later. In this example, the range for age
is +/- 1 and the grouping variable is race
clear all
set more off
*----- example data -----
input ///
idcode age race wage
45 35 1 10.18518
47 35 1 3.526568
48 35 1 5.852843
1 37 2 11.73913
2 37 2 6.400963
9 37 1 10.49114
36 37 1 4.180602
7 39 1 4.62963
15 39 1 16.79548
20 39 1 9.661837
12 40 1 17.20612
13 40 1 13.08374
14 40 1 7.745568
16 40 1 15.48309
18 40 1 5.233495
19 40 1 10.16103
97 40 2 19.92563
22 41 1 9.057972
24 41 1 11.09501
44 41 1 28.45666
98 41 2 4.098635
3 42 2 5.016723
6 42 1 8.083731
23 42 1 8.05153
25 42 1 9.581316
99 42 2 9.875124
4 43 1 9.033813
39 44 1 9.790657
46 44 1 3.051529
end
sort age idcode
list, sepby(age)
*----- what you want -----
gen mwage = .
levelsof race, local(lrace)
forvalues i = 1/`=_N' {
foreach j of local lrace {
summarize wage if ///
inrange(age, age[`i']-1, age[`i']+1) /// age condition
& race == `j' /// race condition
& _n != `i' /// self-exclude condition
, meanonly
replace mwage = r(mean) if race == `j' in `i'
}
}
list, sepby(age)
If Stata is too slow with your database, then you can do it with Mata. Here is my attempt at it (I'm only starting to use it):
clear all
set more off
*----- example data -----
sysuse nlsw88
expand 2
*----- what you want -----
egen gro = group(race industry) // grouping variables
* Get number of groups
summarize gro, meanonly
local numgro = r(max)
* Compute upper limits for groups
forvalues i = 1/`numgro' {
summarize gro if gro == `i', meanonly
local countgro `countgro' `r(N)'
}
/*
sort group and bracking var. sort in Stata so Mata results
can be posted back to Stata using only -getmata-
*/
sort gro age
* Take statistic and bracking variables to Mata
putmata STVAR=wage BRVAR=age
mata:
/*
Get upper limits of groups from Stata.
Not considered good style. See Mata Matters: Macros, Gould (2008)
*/
UPLIM = tokens(st_local("countgro"))
UPLIM = runningsum(strtoreal(UPLIM)) // upper limits of groups
/*
For example, in the following observation ranges, each line
shows lower and upper limits:
1-11
12-23
24-28
29-29
*/
ST = J(rows(STVAR), 1, .)
for (i = 1; i <= cols(UPLIM); i++) {
if (i == 1) {
ro = 1
}
else {
ro = UPLIM[i-1]+1
}
co = UPLIM[i]
STVARP = STVAR[|ro\co|] // statistic variable
BRVARP = BRVAR[|ro\co|] // bracket variable
STPART = J(rows(STVARP), 1, 0)
for (j = 1; j <= rows(BRVARP); j++) {
SMALLER = BRVARP :>= BRVARP[j] - 1
LARGER = BRVARP :<= BRVARP[j] + 1
STPART[j] = ( sum(STVARP :* SMALLER :* LARGER) - STVARP[j] ) / ( sum(SMALLER :* LARGER) - 1 ) //division by zero gives . for last group with only one observation
}
ST[|ro\co|] = STPART // stack results
}
end
getmata mwage=ST
keep wage race industry gro age mwage
sort gro age wage
//list wage gro age matawage, sepby(gro)
Mata is formidable with loops; a database with 15.000 observations takes only a few seconds.