Search code examples
pythonpython-3.xlatexlinear-regressionstatsmodels

Why is `summary_col` ignoring the `info_dict` parameter?


I need to run some linear regressions and output Latex code with statsmodels in Python. I am using the summary_col function to achieve that. However, there is either a bug or a misunderstanding from my side. Please see the following code:

import numpy as np 
import statsmodels.api as sm
from statsmodels.iolib.summary2 import summary_col

np.random.seed(123)

nsample = 100
x = np.linspace(0, 10, 100)
X = np.column_stack((x, x ** 2))
beta = np.array([1, 0.1, 10])
e = np.random.normal(size=nsample)

X = sm.add_constant(X)
y1 = np.dot(X, beta) + e
y2 = np.dot(X, beta) + 2 * e

model1 = sm.OLS(y1, X).fit()
model2 = sm.OLS(y2, X).fit()

Now, to have a table with the two models side by side:

out_table = summary_col(
    [model1, model2],
    stars=True, 
    float_format='%.2f',
    info_dict={
        'N':lambda x: "{0:d}".format(int(x.nobs)),
        'R2':lambda x: "{:.2f}".format(x.rsquared)
    }
)

Hence I'd expect a table providing the number of observations and the $R^2$ only since I am explicit about the info_dict argument. The result I get however is the following:

==============================
                 y I     y II 
------------------------------
const          0.81**  0.63   
               (0.34)  (0.67) 
x1             0.22    0.35   
               (0.16)  (0.31) 
x2             9.99*** 9.98***
               (0.02)  (0.03) 
R-squared      1.00    1.00   
R-squared Adj. 1.00    1.00   
N              100     100    
R2             1.00    1.00   
==============================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01

Please notice how there are two extra rows with the normal r-squared and the adjusted one. My desired behavior would be:

==============================
                 y I     y II 
------------------------------
const          0.81**  0.63   
               (0.34)  (0.67) 
x1             0.22    0.35   
               (0.16)  (0.31) 
x2             9.99*** 9.98***
               (0.02)  (0.03) 
N              100     100    
R2             1.00    1.00   
==============================
Standard errors in
parentheses.
* p<.1, ** p<.05, ***p<.01

The documentation is not the best yet: https://tedboy.github.io/statsmodels_doc/generated/statsmodels.iolib.summary2.summary_col.html

Any ideas on how to displayu only the information asked by the info_dict argument?


Solution

  • Let's have a look at the source code at

    https://github.com/statsmodels/statsmodels/blob/main/statsmodels/iolib/summary2.py

    We can see that the function summary_col takes info_dict as an argument and uses it in the following way

    if info_dict:
        cols = [_col_info(x, info_dict.get(x.model.__class__.__name__, info_dict)) 
            for x in results]
    

    In this case, it means that then is called _col_info(model1, info_dict) and _col_info(model2, info_dict) in order to generate your N and R2 rows. The absence of mypy and comments makes these functions quite obscure actually.

    Later on, the cols list will be added to the variable summ that will be part of a Summary object.

    smry = Summary()
    smry._merge_latex = True
    smry.add_df(summ, header=True, align='l')
    

    However, cols is actually a redefinition, it was defined before as

    cols = [_col_params(x, stars=stars, float_format=float_format) for x in results]
    

    and that constituted the first part of summ.

    The issue is that _col_params will add R-squared and R-squared Adj. whether you like it or not, here is the source code

    rsquared = getattr(result, 'rsquared', np.nan)
    rsquared_adj = getattr(result, 'rsquared_adj', np.nan)
    r2 = pd.Series({('R-squared', ""): rsquared,
                    ('R-squared Adj.', ""): rsquared_adj})
    
    if r2.notnull().any():
        r2 = r2.apply(lambda x: float_format % x)
        res = pd.concat([res, r2], axis=0)
    res = pd.DataFrame(res)
    res.columns = [str(result.model.endog_names)]
    return res
    

    So what I would suggest is to manually modify the tables attribute of your output

    rm_extra_rows = lambda t :  t.iloc[list(range(6)) + [8,9],:]
    out_table.tables = [rm_extra_rows(el) for el in out_table.tables]
    

    After that I get

    In [53]: out_table
    Out[53]: 
    <class 'statsmodels.iolib.summary2.Summary'>
    """
    
    =====================
            y I     y II 
    ---------------------
    const 0.81**  0.63   
          (0.34)  (0.67) 
    x1    0.22    0.35   
          (0.16)  (0.31) 
    x2    9.99*** 9.98***
          (0.02)  (0.03) 
    N     100     100    
    R2    1.00    1.00   
    =====================
    Standard errors in
    parentheses.
    * p<.1, ** p<.05,
    ***p<.01
    """
    

    which should be what you wanted to get.