Search code examples
pythonregressionstatsmodelsdummy-variable

Weekday as dummy / factor variable in a linear regression model using statsmodels


The question:

How can I add a dummy / factor variable to a model using sm.OLS()?

The details:

Data sample structure:

Date    A   B   weekday
2013-05-04  25.03   88.51   Saturday
2013-05-05  52.98   67.99   Sunday
2013-05-06  39.93   75.19   Monday
2013-05-07  47.31   86.99   Tuesday
2013-05-08  19.61   87.94   Wednesday
2013-05-09  39.51   83.10   Thursday
2013-05-10  21.22   62.16   Friday
2013-05-11  19.04   58.79   Saturday
2013-05-12  18.53   75.27   Sunday
2013-05-13  11.90   75.43   Monday
2013-05-14  47.64   64.76   Tuesday
2013-05-15  27.47   91.65   Wednesday
2013-05-16  11.20   59.83   Thursday
2013-05-17  25.10   67.47   Friday
2013-05-18  19.89   64.70   Saturday
2013-05-19  38.91   76.68   Sunday
2013-05-20  42.11   94.36   Monday
2013-05-21  7.845   73.67   Tuesday
2013-05-22  35.45   76.67   Wednesday
2013-05-23  29.43   79.05   Thursday
2013-05-24  33.51   78.53   Friday
2013-05-25  13.58   59.26   Saturday
2013-05-26  37.38   68.59   Sunday
2013-05-27  37.09   67.79   Monday
2013-05-28  21.70   70.54   Tuesday
2013-05-29  11.85   60.00   Wednesday

The following creates a linear regression model of B on A using sm.ols() (including a constant term using sm.add_constant())

Complete code with data sample for regression analysis using statsmodels:

# imports
import pandas as pd
import statsmodels.api as sm

# same data as described above
data = {'Date': {0: '2013-05-04',
          1: '2013-05-05',
          2: '2013-05-06',
          3: '2013-05-07',
          4: '2013-05-08',
          5: '2013-05-09',
          6: '2013-05-10',
          7: '2013-05-11',
          8: '2013-05-12',
          9: '2013-05-13',
          10: '2013-05-14',
          11: '2013-05-15',
          12: '2013-05-16',
          13: '2013-05-17',
          14: '2013-05-18',
          15: '2013-05-19',
          16: '2013-05-20',
          17: '2013-05-21',
          18: '2013-05-22',
          19: '2013-05-23',
          20: '2013-05-24',
          21: '2013-05-25',
          22: '2013-05-26',
          23: '2013-05-27',
          24: '2013-05-28',
          25: '2013-05-29'},
         'A': {0: 25.03,
          1: 52.98,
          2: 39.93,
          3: 47.31,
          4: 19.61,
          5: 39.51,
          6: 21.22,
          7: 19.04,
          8: 18.53,
          9: 11.9,
          10: 47.64,
          11: 27.47,
          12: 11.2,
          13: 25.1,
          14: 19.89,
          15: 38.91,
          16: 42.11,
          17: 7.845,
          18: 35.45,
          19: 29.43,
          20: 33.51,
          21: 13.58,
          22: 37.38,
          23: 37.09,
          24: 21.7,
          25: 11.85},
         'B': {0: 88.51,
          1: 67.99,
          2: 75.19,
          3: 86.99,
          4: 87.94,
          5: 83.1,
          6: 62.16,
          7: 58.79,
          8: 75.27,
          9: 75.43,
          10: 64.76,
          11: 91.65,
          12: 59.83,
          13: 67.47,
          14: 64.7,
          15: 76.68,
          16: 94.36,
          17: 73.67,
          18: 76.67,
          19: 79.05,
          20: 78.53,
          21: 59.26,
          22: 68.59,
          23: 67.79,
          24: 70.54,
          25: 60.0},
         'weekday': {0: 'Saturday',
          1: 'Sunday',
          2: 'Monday',
          3: 'Tuesday',
          4: 'Wednesday',
          5: 'Thursday',
          6: 'Friday',
          7: 'Saturday',
          8: 'Sunday',
          9: 'Monday',
          10: 'Tuesday',
          11: 'Wednesday',
          12: 'Thursday',
          13: 'Friday',
          14: 'Saturday',
          15: 'Sunday',
          16: 'Monday',
          17: 'Tuesday',
          18: 'Wednesday',
          19: 'Thursday',
          20: 'Friday',
          21: 'Saturday',
          22: 'Sunday',
          23: 'Monday',
          24: 'Tuesday',
          25: 'Wednesday'}}

df = pd.DataFrame(data)
df = df.set_index(['Date'])

df['weekday'] =  df['weekday'].astype(object)
independent = df['B'].to_frame()
x = sm.add_constant(independent)

model = sm.OLS(df['A'], x).fit()
model.summary()

Output (shortened):

                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         -1.4328     17.355     -0.083      0.935       -37.252    34.386
B              0.4034      0.233      1.729      0.097        -0.078     0.885
==============================================================================

Now I'd like to add weekday as an explanatory factor variable. I was hoping it would be as easy as changing the data type in the dataframe, but unfortunately that doesn't seem to work although the column was accepted by the x = sm.add_constant(independent) part.

import pandas as pd
import statsmodels.api as sm

df = pd.read_clipboard(sep='\\s+')
df = df.set_index(['Date'])

df['weekday'] =  df['weekday'].astype(object)

independent = df[['B', 'weekday']]
x = sm.add_constant(independent)

model = sm.OLS(df['A'], x).fit()
model.summary()

When you come to the model = sm.OLS(df['A'], x).fit() part, a value error is raised:

ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data).

Any other suggestions?


Solution

  • You can use pandas categorical to create the dummy variables, or, simpler, use the formula interface where patsy transforms all non-numeric columns to the dummy variables, or other factor encoding.

    Using the formula interface in this case (same as lower case ols in statsmodels.formula.api) shows the result below. Patsy sorts levels of the categorical variable alphabetically. 'Friday' is missing in the list of variables and has been selected as reference category.

    >>> res = sm.OLS.from_formula('A ~ B + weekday', df).fit()
    >>> print(res.summary())
                                OLS Regression Results                            
    ==============================================================================
    Dep. Variable:                      A   R-squared:                       0.301
    Model:                            OLS   Adj. R-squared:                  0.029
    Method:                 Least Squares   F-statistic:                     1.105
    Date:                Thu, 03 May 2018   Prob (F-statistic):              0.401
    Time:                        15:26:02   Log-Likelihood:                -97.898
    No. Observations:                  26   AIC:                             211.8
    Df Residuals:                      18   BIC:                             221.9
    Df Model:                           7                                         
    Covariance Type:            nonrobust                                         
    ========================================================================================
                               coef    std err          t      P>|t|      [0.025      0.975]
    ----------------------------------------------------------------------------------------
    Intercept               -1.4717     19.343     -0.076      0.940     -42.110      39.167
    weekday[T.Monday]        2.5837      9.857      0.262      0.796     -18.124      23.291
    weekday[T.Saturday]     -6.5889      9.599     -0.686      0.501     -26.755      13.577
    weekday[T.Sunday]        9.2287      9.616      0.960      0.350     -10.975      29.432
    weekday[T.Thursday]     -1.7610     10.321     -0.171      0.866     -23.445      19.923
    weekday[T.Tuesday]       2.6507      9.664      0.274      0.787     -17.652      22.953
    weekday[T.Wendesday]    -6.9320      9.911     -0.699      0.493     -27.754      13.890
    B                        0.4047      0.258      1.566      0.135      -0.138       0.948
    ==============================================================================
    Omnibus:                        1.039   Durbin-Watson:                   2.313
    Prob(Omnibus):                  0.595   Jarque-Bera (JB):                0.532
    Skew:                          -0.350   Prob(JB):                        0.766
    Kurtosis:                       3.007   Cond. No.                         638.
    ==============================================================================
    
    Warnings:
    [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
    

    See patsy documentation for options for categorical encodings http://patsy.readthedocs.io/en/latest/categorical-coding.html

    For example, the reference coding can be specified explicitly as in this formula

    "A ~ B + C(weekday, Treatment('Sunday'))"
    

    http://patsy.readthedocs.io/en/latest/API-reference.html#patsy.Treatment