Search code examples
pythonregressionstatadummy-variable

Python: Regression slow compared with Stata (fixed-effect dummies)


I am trying to run a regression in Python, but it just takes ages and stops running. In Stata it works and only takes a few seconds.

This is due to a categorical column, including group fixed-effects. Without the variable, the performance of Stata and Python is rather equal, taking around 1 second for 200,000 observations:

Code Stata

reg income height Number_children

Code Python

model = smf.ols(income ~ height + Number_children, data=humans).fit() 

Adding the dummy, I change the Stata code to areg:

areg income height Number_children, absorb(Village)

It takes only 1-2 seconds longer than without the dummy.

In Python:

model = smf.ols(income ~ height + Number_children + Village, data=humans).fit()

where:

Name: Village, dtype: category
Categories (3678, object):

I stop the regression after 2 minutes of waiting. Is there any idea how to get the code running, and increase the speed to nearly as fast as Stata? Is the problem more caused by the variable or by the regression command?

  • EDIT:

Based on Dimitriy´s response, I tried this for all variables:

e.g.:

humans["income_gr_m"]= humans["income"].groupby(humans['Village']).mean()
humans["income_star"] = humans["income"] - humans["income_gr_m"] + humans["income"].mean()

However, this also makes Python work like at least 2 minutes (I stopped again). Or should the transformation be performed differently? Thx


Solution

  • areg is not actually inverting that matrix with 3,677 village indicators like you are doing with Python. It is transforming the data in a way to obviates the need to do that, so it will be much faster. This is also the reason why the constant from regress with village dummies will not match the constant from areg, though the slope coefficients should be the same, if you wait for Python to finish.

    Here is what areg does to calculate the coefficients with regress. The standard errors will too large since I am not doing the degree of freedom adjustment for the 5 absorbed effects, but I will do the adjustment manually below in a loop by multiplying the SEs:

    . sysuse auto, clear
    (1978 Automobile Data)
    
    . drop if missing(rep78)
    (5 observations deleted)
    
    . /* (1) transform the data by subtracting the group specific mean and */
    . /* adding the grand/overall mean back in for outcome and regressors */
    . foreach var of varlist price weight length foreign {
      2.         bys rep78: egen group_mean = mean(`var')
      3.         qui sum `var'
      4.         gen double `var'_star = `var' - group_mean + r(mean)
      5.         drop group_mean
      6. }
    
    . /* (2) Fit the model on transformed data */
    . regress price_star weight_star length_star foreign_star
    
          Source |       SS           df       MS      Number of obs   =        69
    -------------+----------------------------------   F(3, 65)        =     26.99
           Model |   315296838         3   105098946   Prob > F        =    0.0000
        Residual |   253139578        65  3894455.05   R-squared       =    0.5547
    -------------+----------------------------------   Adj R-squared   =    0.5341
           Total |   568436416        68  8359359.06   Root MSE        =    1973.4
    
    ------------------------------------------------------------------------------
      price_star |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
     weight_star |    6.15521   1.008605     6.10   0.000     4.140885    8.169534
     length_star |  -100.9268   33.82508    -2.98   0.004    -168.4801   -33.37341
    foreign_star |   3394.052    782.454     4.34   0.000     1831.383     4956.72
           _cons |   5453.782   3829.487     1.42   0.159    -2194.232     13101.8
    ------------------------------------------------------------------------------
    
    . /* (3) Adjust the SEs for DoF */
    . foreach coef in weight_star length_star foreign_star _cons {
      2.         di "Adjusted SE for `coef': " %9.8gc _se[`coef']*sqrt(65/61)
      3. }
    Adjusted SE for weight_star:  1.041149
    Adjusted SE for length_star:  34.91649
    Adjusted SE for foreign_star:  807.7009
    Adjusted SE for _cons:   3953.05
    
    . /* (4) Make sure areg gives the same output */
    . areg price weight length foreign, absorb(rep78)
    
    Linear regression, absorbing indicators         Number of obs     =         69
                                                    F(   3,     61)   =      25.33
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.5611
                                                    Adj R-squared     =     0.5108
                                                    Root MSE          =  2037.1129
    
    ------------------------------------------------------------------------------
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
          weight |    6.15521   1.041149     5.91   0.000     4.073303    8.237116
          length |  -100.9268   34.91649    -2.89   0.005    -170.7466   -31.10692
         foreign |   3394.052   807.7009     4.20   0.000     1778.954    5009.149
           _cons |   5453.782    3953.05     1.38   0.173    -2450.831    13358.39
    -------------+----------------------------------------------------------------
           rep78 |          F(4, 61) =      0.261   0.902           (5 categories)
    

    Stata Code:

    cls
    sysuse auto, clear
    drop if missing(rep78)
    /* (1) transform the data by subtracting the group specific mean and */
    /* adding the grand/overall mean back in for outcome and regressors */
    foreach var of varlist price weight length foreign {
        bys rep78: egen group_mean = mean(`var')
        qui sum `var'
        gen double `var'_star = `var' - group_mean + r(mean)
        drop group_mean
    }
    /* (2) Fit the model on transformed data */
    regress price_star weight_star length_star foreign_star
    /* (3) Adjust the SEs for DoF */
    foreach coef in weight_star length_star foreign_star _cons {
        di "Adjusted SE for `coef': " %9.8gc _se[`coef']*sqrt(65/61)
    }
    /* (4) Make sure areg gives the same output */
    areg price weight length foreign, absorb(rep78)