Get data observations used by regression in R (plm)

I am estimating a panel model with the package plm. Some of the individuals in the panel do not have data for all the explanatory variables, so they are excluded from the regression. How could I see which particular observations have been used for the estimation?

In Stata the usual command is e(sample). What is the equivalent in R?

Solution

The data used for the model is stored in the list returned by the plm function. The list contains several elements, one of which is named model. That's where the data used for the model is stored. Here's an example based on the help for plm:

library(plm)

data("Produc")

Let's set the first 20 values of Produc$pcap to NA (missing data):

Produc$pcap[1:20] = NA

Now we'll create a plm model using Produc:

zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
          data = Produc, index = c("state","year"))

zz is a list containing the information returned by the plm function. You can run str(zz) to see what zz contains. The data used for the model is stored in zz$model. You can see by the rownames, which start at 21, that the first 20 rows are missing, because those are the ones in which we set Produc$pcap to NA.

head(zz$model)  # You can also do: head(zz[["model"]])

   log(gsp) log(pcap)  log(pc) log(emp) unemp
21 10.13634  9.358610 10.21481 6.571583   4.1
22 10.15417  9.403360 10.26915 6.614726   5.6
23 10.12323  9.467233 10.31703 6.591811  12.0
24 10.16743  9.518111 10.28821 6.631606   9.8
25 10.24388  9.559265 10.31137 6.696170   8.2
26 10.34374  9.603196 10.34623 6.797271   6.1

If you want to select the rows of your data frame that were used in the model, you can use the rownames of zz$model as the indices for subsetting:

Produc[rownames(zz$model), ]

Produc[complete.cases(Produc), ] will return only those rows of the data frame without any missing data. Note, though, that if there are columns in your data frame that have missing data, but that were not used in the model formula, then this approach will, in general, exclude some rows of data that were nevertheless used in the model (the exception being the case where missing data in columns not used in the model is always accompanied in the same rows by missing data in columns used in the model).