Predict on test data, using plm package in R, and calculate RMSE for test data

I built a model, using plm package. The sample dataset is here.

I am trying to predict on test data and calculate metrics.

# Import package
library(plm)
library(tidyverse)
library(prediction)
library(nlme)

# Import data 
df <- read_csv('Panel data sample.csv')

# Convert author to character
df$Author <- as.character(df$Author) 

# Split data into train and test
df_train <- df %>% filter(Year != 2020) # 2017, 2018, 2019
df_test <- df %>% filter(Year == 2020) # 2020

# Convert data
panel_df_train <- pdata.frame(df_train, index = c("Author", "Year"), drop.index = TRUE, row.names = TRUE)
panel_df_test <- pdata.frame(df_train, index = c("Author", "Year"), drop.index = TRUE, row.names = TRUE)

# Create the first model
plmFit1 <- plm(Score ~ Articles, data = panel_df_train)

# Print
summary(plmFit1)

# Get the RMSE for train data
sqrt(mean(plmFit1$residuals^2))

# Get the MSE for train data
mean(plmFit1$residuals^2)

Now I am trying to calculate metrics for test data

First, I tried to use prediction() from prediction package, which has an option for plm.

predictions <- prediction(plmFit1, panel_df_test)

Got an error:

Error in crossprod(beta, t(X)) : non-conformable arguments

I read the following questions:

I also read this question, but

fitted <- as.numeric(plmFit1$model[[1]] - plmFit1$residuals) gives me a different number of values from my train or test numbers.

Solution

Regarding out-of-sample prediction with fixed effects models, it is not clear how data relating to fixed effects not in the original model are to be treated, e.g., data for an individual not contained in the orignal data set the model was estimated on. (This is rather a methodological question than a programming question).

The version 2.6-2 of plm allows predict for fixed effect models with the original data and with out-of-sample data (see ?predict.plm).

Find below an example with 10 firms for model estimation and the data to be used for prediction contains a firm not contained in the original data set (besides that firm, there are also years not contained in the original model object but these are irrelevant here as it is a one-way individual model). It is unclear what the fixed effect of that out-of-sample firm would be. Hence, by default, no predicted value is given (NA value). If argument na.fill is set to TRUE, the (weighted) mean of the fixed effects contained in the original model object is used as a best guess.

library(plm)
data("Grunfeld", package = "plm")

# fit a fixed effect model
fit.fe <- plm(inv ~ value + capital, data = Grunfeld, model = "within")

# generate 55 new observations of three firms used for prediction:
#  * firm 1 with years 1935:1964 (has out-of-sample years 1955:1964), 
#  * firm 2 with years 1935:1949 (all in sample),
#  * firm 11 with years 1935:1944 (firm 11 is out-of-sample)
set.seed(42L)

new.value2   <- runif(55, min = min(Grunfeld$value),   max = max(Grunfeld$value))
new.capital2 <- runif(55, min = min(Grunfeld$capital), max = max(Grunfeld$capital))

newdata <- data.frame(firm = c(rep(1, 30), rep(2, 15), rep(11, 10)),
                      year = c(1935:(1935+29), 1935:(1935+14), 1935:(1935+9)),
                      value = new.value2, capital = new.capital2)
# make pdata.frame
newdata.p <- pdata.frame(newdata, index = c("firm", "year"))

## predict from fixed effect model with new data as pdata.frame
predict(fit.fe, newdata = newdata.p) # has NA values for the 11'th firm

## set na.fill = TRUE to have the weighted mean used to for fixed effects -> no NA values
predict(fit.fe, newdata = newdata.p, na.fill = TRUE)

NB: When you input a plain data.frame as newdata, it is not clear how the data related to the individuals and time periods, which is why the weighted mean of fixed effects from the original model object is used for all observations in newdata and a warning is printed. For fixed effect model prediction, it is reasonable to assume the user can provide information (via a pdata.frame) how the data the user wants to use for prediction relates to the individual and time dimension of panel data.