I am estimating a panel model with the package plm
.
Some of the individuals in the panel do not have data for all the explanatory variables, so they are excluded from the regression.
How could I see which particular observations have been used for the estimation?
In Stata the usual command is e(sample)
. What is the equivalent in R?
The data used for the model is stored in the list returned by the plm
function. The list contains several elements, one of which is named model
. That's where the data used for the model is stored. Here's an example based on the help for plm
:
library(plm)
data("Produc")
Let's set the first 20 values of Produc$pcap
to NA
(missing data):
Produc$pcap[1:20] = NA
Now we'll create a plm
model using Produc
:
zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc, index = c("state","year"))
zz
is a list containing the information returned by the plm
function. You can run str(zz)
to see what zz
contains. The data used for the model is stored in zz$model
. You can see by the rownames, which start at 21, that the first 20 rows are missing, because those are the ones in which we set Produc$pcap
to NA
.
head(zz$model) # You can also do: head(zz[["model"]])
log(gsp) log(pcap) log(pc) log(emp) unemp 21 10.13634 9.358610 10.21481 6.571583 4.1 22 10.15417 9.403360 10.26915 6.614726 5.6 23 10.12323 9.467233 10.31703 6.591811 12.0 24 10.16743 9.518111 10.28821 6.631606 9.8 25 10.24388 9.559265 10.31137 6.696170 8.2 26 10.34374 9.603196 10.34623 6.797271 6.1
If you want to select the rows of your data frame that were used in the model, you can use the rownames of zz$model
as the indices for subsetting:
Produc[rownames(zz$model), ]
Produc[complete.cases(Produc), ]
will return only those rows of the data frame without any missing data. Note, though, that if there are columns in your data frame that have missing data, but that were not used in the model formula, then this approach will, in general, exclude some rows of data that were nevertheless used in the model (the exception being the case where missing data in columns not used in the model is always accompanied in the same rows by missing data in columns used in the model).