I am doing a fixed effect model using PanelOLS in Python, and I have use plm in R to validate my result, to my surprise, the coefficient and P values are different between these two, even though they are supposed to be the same?
Data is from R’s dataset
library(AER)
data(Fatalities)
# define the fatality rate
Fatalities$fatal_rate <- Fatalities$fatal / Fatalities$pop * 10000
# mandadory jail or community service
Fatalities$punish <- with(Fatalities, factor(jail == "yes" | service == "yes",
labels=c("no", "yes")))
Below is my Python code for PanelOLS
w1=Fatalities.set_index(["state", "year"])
mod = PanelOLS(w1['fatal_rate'], w1[['beertax','drinkage','punish', 'miles' ,
'unemp','income']], entity_effects=True)
result=mod.fit(cov_type='clustered', cluster_entity=True)
display(result.summary)
Below is my R code for plm
fatalities_mod6 <- plm(fatal_rate ~ beertax + year + drinkage + punish + miles +
unemp + log(income), index=c("state", "year"),
model="within", effect="twoways", data=Fatalities)
I also would like to ask about the cov_type
in PanelOLS
, as I understood, if I would like to have robust standard error and P value, I should use cov_type=’robust’
, instead of ‘clustered’
. But I see many fixed effect examples using ‘clustered’
—which one should I use to get the correct standard error and p-values for the variables?
panelOLS
are:plm
:Two issues, 1. you're using year
variable in the plm
formula which is redundant because it's already index
ed, and 2. your Python PanelOLS
code calculates individual fixed effects so far, I can replicate the Python estimates with plm
using effect="individual"
.
library(plm)
fatalities_mod6 <- plm(fatal_rate ~ beertax + drinkage + punish + miles +
unemp + log(income), index=c("state", "year"),
model="within", effect="individual", data=Fatalities)
Furthermore Python's PanelOLS
appears to use standard errors clustered on state applying the Arellano method using heteroscedasticity-consistent standard errors type 1 ("HC1"
).
round(summary(fatalities_mod6,
vcov=vcovHC.plm(fatalities_mod6, cluster="group", type="HC1",
method="arellano"))$coe, 4)
# Estimate Std. Error t-value Pr(>|t|)
# beertax -0.3664 0.2920 -1.2550 0.2105
# drinkage -0.0378 0.0252 -1.4969 0.1355
# punishyes -0.0342 0.0951 -0.3598 0.7193
# miles 0.0000 0.0000 -0.4217 0.6736
# unemp -0.0196 0.0128 -1.5385 0.1250
# log(income) 0.6765 0.5424 1.2472 0.2134
This resembles the PanelOLS
result of Python.
For the two-way fixed effects estimator of your data with cluster-robust standard errors, the code would be,
for Python:
mod = PanelOLS(w1['fatal_rate'], w1[['beertax','drinkage','punish', 'miles' ,
'unemp','income']],
entity_effects=True, time_effects=True)
and for R:
fatalities_mod6 <- plm(fatal_rate ~ beertax + drinkage + punish + miles +
unemp + log(income), index=c("state", "year"),
model="within", effect="twoways", data=Fatalities)
summary(fatalities_mod6,
vcov=vcovHC.plm(fatalities_mod6, cluster="group", type="HC1"))
Here follow comparisons of Python, R, and Stata, conducted with the grunfeld
data that come with statsmodels
in Python (they're slightly different from the data(Grunfeld, package="plm")
).
PanelOLS
(Python), plm
(R), and xtreg
with vce(cluster *clustvar*)
(Stata) appear to apply slightly different methods to calculate cluster-robust standard errors (see linked docs for details).
Python:
from statsmodels.datasets import grunfeld
data = grunfeld.load_pandas().data
data = data.set_index(['firm','year'])
import pandas as pd
data.to_csv("grunfeld.csv") ## export data for R / Stata
from linearmodels import PanelOLS
mod = PanelOLS.from_formula('invest ~ value + capital + EntityEffects + TimeEffects',
data=data)
print(mod.fit())
# Parameter Std. Err. T-stat P-value Lower CI Upper CI
# ------------------------------------------------------------------------------
# value 0.1167 0.0129 9.0219 0.0000 0.0912 0.1422
# capital 0.3514 0.0210 16.696 0.0000 0.3099 0.3930
print(mod.fit(cov_type='robust'))
# value 0.1167 0.0191 6.1087 0.0000 0.0790 0.1544
# capital 0.3514 0.0529 6.6472 0.0000 0.2471 0.4557
print(mod.fit(cov_type='clustered', cluster_entity=True))
# value 0.1167 0.0113 10.368 0.0000 0.0945 0.1389
# capital 0.3514 0.0470 7.4836 0.0000 0.2588 0.4441
print(mod.fit(cov_type='clustered', cluster_entity=True, cluster_time=True))
# value 0.1167 0.0117 10.015 0.0000 0.0937 0.1397
# capital 0.3514 0.0447 7.8622 0.0000 0.2633 0.4396
R:
grunfeld <- read.csv("V:/Python/fatalities/grunfeld.csv")
library(plm)
fit <- plm(invest ~ value + capital, grunfeld, effect="twoways", model="within",
index=c("firm", "year"))
## resembles exactly py mod.fit()
round(summary(fit)$coe, 4)
# Estimate Std. Error t-value Pr(>|t|)
# value 0.1167 0.0129 9.0219 0
# capital 0.3514 0.0210 16.6964 0
## resembles approximately py mod.fit('cluster', cluster_entity=True) and Stata (below)
round(summary(fit, cluster="group",
vcov=vcovHC.plm(fit, method="arellano", type="HC2")
)$coe, 4)
# Estimate Std. Error t-value Pr(>|t|)
# value 0.1167 0.0114 10.2042 0
# capital 0.3514 0.0507 6.9364 0
## doesn't seem to change with two-way clustering
round(summary(fit, cluster=c("group", "time"),
vcov=vcovHC.plm(fit, method="arellano", type="HC2")
)$coe, 4)
# Estimate Std. Error t-value Pr(>|t|)
# value 0.1167 0.0114 10.2042 0
# capital 0.3514 0.0507 6.9364 0
fit.1 <- lm(invest ~ value + capital + factor(firm) + factor(year) + 0, grunfeld)
library(lmtest)
## resembles exactly py mod.fit(cov_type='robust')
round(coeftest(fit.1, vcov.=sandwich::vcovHC(fit.1, type="HC1"))[1:2,], 4)
# Estimate Std. Error t value Pr(>|t|)
# value 0.1167 0.0191 6.1087 0
# capital 0.3514 0.0529 6.6472 0
Stata:
import delim grunfeld.csv, clear
egen firmn = group(firm)
xtset firmn year
xtreg invest value capital i.year, fe vce(cluster firmn)
# | Robust
# invest | Coef. Std. Err. t P>|t| [95% Conf. Interval]
# -------------+----------------------------------------------------------------
# value | .1166811 .0114755 10.17 0.000 .0911122 .1422501
# capital | .3514357 .0478837 7.34 0.000 .2447441 .4581273