Search code examples
pythonpandaslinear-regressionleast-squares

Getting list of column names and coeff in ols.param


I'm using OLS with respect to two dataframes:

gab = ols(formula= 'only_volume ~ all_but_volume', data=data_p ).fit() 

where,

only_volume = data_p.iloc[:,0] #Only first colum
all_but_volume = data_p.iloc[:, 1:data_p.shape[1]] #All but first column

When I try to extract something, say parameter or pvals, I get something like this:

In [3]: gab.params
Out[3]: 
Intercept             2.687598e+06
all_but_volume[0]     5.500544e+01
all_but_volume[1]     2.696902e+02
all_but_volume[2]     3.389568e+04
all_but_volume[3]    -2.385838e+04
all_but_volume[4]     5.419860e+02
all_but_volume[5]     3.815161e+02
all_but_volume[6]    -2.281344e+04
all_but_volume[7]     1.794128e+04
...
all_but_volume[22]    1.374321e+00

Since gab.params provides with 23 values in LHS and all_but_volume has 23 columns, I was hoping if there was a way to get a list/zip of params with column names, instead of params with all_but_volume[i]

Like,

TMC     9.801195e+01
TAC     2.214464e+02
...

What I've tried: removing all_but_volume and simply using data_p.iloc[:, 1:data_p.shape[1]]

Didn't work:

...
data_p.iloc[:, 1:data_p.shape[1]][21]    2.918531e+04
data_p.iloc[:, 1:data_p.shape[1]][22]    1.395342e+00

Edit: Sample Data:

data_p.iloc[1:5,:]
Out[31]: 
          Volume             A              B                  C\
1  569886.171878    759.089217     272.446022           4.163908   
2  561695.886128    701.165406     330.301260           4.136530   
3  627221.486089    377.746089     656.838394           4.130720   
4  625181.750625    361.489041     670.575110           4.134467   

                          D         E        F      G      H     I  \
1                  1.000842  12993.06  3371.28  236.90  4.92  6.13   
2                  0.981514  13005.44  3378.69  236.94  4.92  6.13   
3                  0.836920  13017.22  3384.47  236.98  4.93  6.13   
4                  0.810541  13028.56  3388.85  237.01  4.94  6.13   

                          J               K       L       M           N  \
1      ...                0               0       0        0          0   
2      ...                0               0       0        0          0   
3      ...                0               0       0        0          0   
4      ...                0               0       0        0          0   

           O             P     Q             R   S  
1          0             0     0             1   9202.171648  
2          0             0     0             0   4381.373520  
3          0             0     0             0 -13982.443554  
4          0             0     0             0 -22878.843149

only_volume is the first column 'volume' all_but_volume is all columns except 'volume'


Solution

  • You can use DataFrame constructor or rename, because gab.params is Series:

    Sample:

    np.random.seed(2018)
    
    import statsmodels.formula.api as sm
    data_p = pd.DataFrame(np.random.rand(10, 5), columns=['Volume','A','B','C','D'])
    print (data_p)
         Volume         A         B         C         D
    0  0.882349  0.104328  0.907009  0.306399  0.446409
    1  0.589985  0.837111  0.697801  0.802803  0.107215
    2  0.757093  0.999671  0.725931  0.141448  0.356721
    3  0.942704  0.610162  0.227577  0.668732  0.692905
    4  0.416863  0.171810  0.976891  0.330224  0.629044
    5  0.160611  0.089953  0.970822  0.816578  0.571366
    6  0.345853  0.403744  0.137383  0.900934  0.933936
    7  0.047377  0.671507  0.034832  0.252691  0.557125
    8  0.525823  0.352968  0.092983  0.304509  0.862430
    9  0.716937  0.964071  0.539702  0.950540  0.667982
    

    only_volume = data_p.iloc[:,0] #Only first colum
    all_but_volume = data_p.iloc[:, 1:data_p.shape[1]] #All but first column
    gab = sm.ols(formula= 'only_volume ~ all_but_volume', data=data_p ).fit() 
    print (gab.params)
    Intercept            0.077570
    all_but_volume[0]    0.395072
    all_but_volume[1]    0.313150
    all_but_volume[2]   -0.100752
    all_but_volume[3]    0.247532
    dtype: float64
    
    print (type(gab.params))
    <class 'pandas.core.series.Series'>
    
    df = pd.DataFrame({'cols':data_p.columns[1:], 'par': gab.params.values[1:]})
    print (df)
      cols       par
    0    A  0.395072
    1    B  0.313150
    2    C -0.100752
    3    D  0.247532
    

    If want return Series:

    s = gab.params.rename(dict(zip(gab.params.index, data_p.columns)))
    print (s)
    Volume    0.077570
    A         0.395072
    B         0.313150
    C        -0.100752
    D         0.247532
    dtype: float64
    

    Series without first value:

    s = gab.params.iloc[1:].rename(dict(zip(gab.params.index, data_p.columns)))
    print (s)
    
    A    0.395072
    B    0.313150
    C   -0.100752
    D    0.247532
    dtype: float64