I have a 3 million obs data set. I need to estimate a LPM with SUR, and get the marginal effects.
I used gsem... vce(cluster x)
, then margins, ... force
. But it takes a very long time to get the margins result (more than 2 hours). I do need to standard errors for CI, so I can't not use the nose
option.
Is there other ways I can improve the speed?
Exact code depends on which marginal effects you mean exactly. You can calculate partial effects with lincom
, which will most likely by faster than margins
.
As an example, suppose we estimate this model:
The partial effect of x1 on y can be obtained by taking the partial derivative with respect to x1:
We can get the effect of x1 on y at the means of x2 and x3 by plugging in the means. To do this in Stata:
// Get data
webuse regress
// Run the regression
qui reg y c.x1##c.(x2 x3)
// Get the sample means of x2 and x3
sum x2 if e(sample), meanonly
scalar m_x2 = r(mean)
sum x3 if e(sample), meanonly
scalar m_x3 = r(mean)
// Calculate partial effect
lincom x1 + m_x2 * c.x1#c.x2 + m_x3*c.x1#c.x3
Result:
. lincom x1 + m_x2 * c.x1#c.x2 + m_x3*c.x1#c.x3
( 1) x1 - .2972973*c.x1#c.x2 + 3019.459*c.x1#c.x3 = 0
------------------------------------------------------------------------------
y | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
(1) | 1.409372 1.005254 1.40 0.163 -.5778255 3.396569
------------------------------------------------------------------------------
As you can see, this is the same as the results obtained by margins:
. qui reg y c.x1##c.(x2 x3)
. margins, dydx(x1) atmeans
Conditional marginal effects Number of obs = 148
Model VCE : OLS
Expression : Linear prediction, predict()
dy/dx w.r.t. : x1
at : x1 = 3.014865 (mean)
x2 = -.2972973 (mean)
x3 = 3019.459 (mean)
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
x1 | 1.409372 1.005254 1.40 0.163 -.5778256 3.396569
------------------------------------------------------------------------------
Here's a speed comparison showing that lincom
is 14 times faster than margins
in this case with 3 million observations:
clear
webuse regress
expand 20271
gen lincom = .
gen margins = .
qui reg y c.x1##c.(x2 x3)
forval i = 1/50 {
timer clear
timer on 1
sum x2 if e(sample), meanonly
scalar m_x2 = r(mean)
sum x3 if e(sample), meanonly
scalar m_x3 = r(mean)
lincom x1 + m_x2 * c.x1#c.x2 + m_x3*c.x1#c.x3
timer off 1
timer on 2
margins, dydx(x1) atmeans
timer off 2
timer list
replace lincom = r(t1) in `i'
replace margins = r(t2) in `i'
}
ttest lincom == margins
di "On average, lincom is " %4.2f `=r(mu_2) / r(mu_1)' " times faster than margins with `=_N' observations"
// On average, lincom is 13.88 times faster than margins with 3000108 observations