python statistics regression statsmodels

Regression coefficients do not match conditional means

You can download the following data set from this repo.

	Y	CONST	T	X1	X1T	X2	X2T
0	2.31252	1	1	0	0	1	1
1	-0.836074	1	1	1	1	1	1
2	-0.797183	1	0	0	0	1	0

I have a dependent variable (Y) and three binary columns (T, X1 and X2). From this data we can create four groups:

X1 == 0 and X2 == 0
X1 == 0 and X2 == 1
X1 == 1 and X2 == 0
X1 == 1 and X2 == 1

Within each group, I want to calculate the difference in the mean of Y between observations with T == 1 and T == 0.

I can do so with the following code:

# Libraries
import pandas as pd

# Group by T, X1, X2 and get the mean of Y
t = df.groupby(['T','X1','X2'])['Y'].mean().reset_index()

# Reshape the result and rename the columns
t = t.pivot(index=['X1','X2'], columns='T', values='Y')
t.columns = ['Teq0','Teq1']

# I want to replicate these differences with a regression
t['Teq1'] - t['Teq0']

> X1  X2
> 0   0     0.116175
>     1     0.168791
> 1   0    -0.027278
>     1    -0.147601

Problem

I want to recreate these results with the following regression model (m).

# Libraries
from statsmodels.api import OLS

# Fit regression with interaction terms
m = OLS(endog=df['Y'], exog=df[['CONST','T','X1','X1T','X2','X2T']]).fit()

# Estimated values
m.params[['T','X1T','X2T']]

> T      0.162198
> X1T   -0.230372
> X2T   -0.034303

I was expecting the coefficients:

T = 0.116175
T + X1T = 0.168791
T + X2T = -0.027278
T + X1T + X2T = -0.147601

Question

Why don't the regression coefficients match the results from the first chunk's output (t['Teq1'] - t['Teq0'])?

Solution

Thanks to @Josef for noticing that T, X1 and X2 have eight different combinations while my regression model has six parameters. I was therefore missing two interaction terms (and thus two parameters).

Namely, the regression model needs to account for the interaction between X1 and X2 as well as the interaction between X1, X2 and T.

This can be done by declaring the missing interaction columns and fitting the model:

# Declare missing columns
df = df.assign(X1X2 = df['X1'].multiply(df['X2']),
               X1X2T = df['X1'].multiply(df['X2T']))

# List of independent variables
cols = ['CONST','T','X1','X1T','X2','X2T','X1X2','X1X2T']

# Fit model
m = OLS.fit(endog=df['Y'], exog=df[cols]).fit()

Alternatively, we can use the formula interface:

# Declare formula
f = 'Y ~ T + X1 + I(X1*T) + X2 + I(X2*T) + I(X1*X2) + I(X1*X2*T)'

# Fit model
m = OLS.from_formula(formula=f, data=df).fit()