Search code examples
rdataframepanel-dataplm

R, add back fitted values plm(), the fitted values are fewer than the observations in the regression


We're doing a panel regression using the plm() function of R package plm and want add the fitted values as a new column to the dataset on which the regression was made.

MP_regression <- plm(operating_exp ~ HHI + rate + rate_lag1 + rate_lag2 +
                   HHI*rate + HHI*rate_lag1 + HHI*lag2,
                 data = market_power_merged, effect = "individual",
                 model = "within", index = c("firm", "date"))

When we use fitted(MP_regression) as such:

fitted_values <- fitted(MP_regression)

then it produces fewer fitted values than the observations in the input data for the regression. So we want to add them back to the market_power_merged dataframe by date and firm. Becase of the fewer fitted values (that the fitted() function for some reason produces), it is important to match by both date and firm so we can see what observations were excluded in the fitted function, or alternatively remove those for which the fitted function does not produce a value.

In essence we want to:

market_power_merged <- mutate(fitted_values = fitted(MP_regression)

and match them by firm (individual) and date (time).


Solution

  • Apparently, the return of fitted() carries an index attribute which is a data frame of the panel groups for fitted values. Therefore, consider cbind on this index attribute to fitted values and then run left_join or merge (with all.x=TRUE) on original data frame:

    fitted_values_vec <- fitted(MP_regression)
    fitted_values_df <- cbind(attr(fitted_values_vec, "index"), 
                              fitted_values = fitted_values_vec)
    
    Produc <- base::merge(Produc, fit_values, by=c("firm", "date"), all.x=TRUE)    
    # Produc <- dplyr::left_join(Produc, fit_values, by=c("firm", "date"))
    

    To demonstrate with built-in plm data frame, Produc:

    data("Produc", package = "plm")
    
    # ASSIGN RANDOM NAs ACROSS NON-PANEL COLUMNS
    set.seed(41120)
    for(col in names(Produc)[!names(Produc) %in% c("state", "year")]) {
      Produc[sample(nrow(Produc), 50), col] <- NA
    }
    
    results <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
                   data = Produc, index = c("state","year"))
    
    fitted_values_vec <- fitted(results)
    str(fitted_values_vec)
    # 'pseries' Named num [1:588] -0.2459 -0.2274 -0.0927 -0.0981 -0.0184 ...
    # - attr(*, "names")= chr [1:588] "ALABAMA" "ALABAMA" "ALABAMA" "ALABAMA" ...
    # - attr(*, "index")=Classes ‘pindex’ and 'data.frame': 588 obs. of  2 variables:
    #   ..$ state: Factor w/ 48 levels "ALABAMA","ARIZONA",..: 1 1 1 1 1 1 1 1 1 1 ...
    #   ..$ year : Factor w/ 17 levels "1970","1971",..: 1 2 5 6 7 8 9 10 12 13 ...
    
    
    fitted_values_df <- cbind(attr(fitted_values_vec, "index"), 
                              fitted_values = fitted_values_vec)
    
    Produc <- merge(Produc, fitted_values_df, by= c("state","year"), all.x=TRUE)
    

    Output

    head(Produc,10)
    
    #      state year region     pcap     hwy   water    util       pc   gsp    emp unemp fitted_values
    # 1  ALABAMA 1970      6 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5   4.7   -0.24591969
    # 2  ALABAMA 1971      6 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9   5.2   -0.22735513
    # 3  ALABAMA 1972      6 15972.41 7765.42 1764.75 6442.23       NA 31303 1072.3    NA            NA
    # 4  ALABAMA 1973   <NA>       NA 7907.66 1742.41 6756.19 40084.01 33430 1135.5   3.9            NA
    # 5  ALABAMA 1974      6 16762.67 8025.52      NA 7002.29 42057.31 33749 1169.8   5.5   -0.09272471
    # 6  ALABAMA 1975      6 17316.26 8158.23      NA 7405.76 43971.71 33604 1155.4   7.7   -0.09806212
    # 7  ALABAMA 1976      6 17732.86      NA 1799.74 7704.93 50221.57 35764 1207.0   6.8   -0.01841929
    # 8  ALABAMA 1977      6 18111.93 8365.67 1845.11 7901.15 51084.99 37463 1269.2   7.4    0.02047675
    # 9  ALABAMA 1978      6 18479.74 8510.64 1960.51 8008.59 52604.05 39964 1336.5   6.3    0.07225304
    # 10 ALABAMA 1979      6 18881.49 8640.61 2081.91 8158.97 54525.86 40979 1362.0   7.1    0.09364171
    
    tail(Produc,10)
    
    #       state year region    pcap     hwy  water    util       pc   gsp   emp unemp fitted_values
    # 807 WYOMING 1977      8 4037.03 2898.34 291.64  847.04 19977.67  9779 170.5   3.6     0.0871588
    # 808 WYOMING 1978      8 4115.61 2920.85 294.73  900.04 20760.24 11038 187.4    NA            NA
    # 809 WYOMING 1979      8 4268.71 2950.53 313.47 1004.71 21643.50 11988 200.7   2.8     0.2346269
    # 810 WYOMING 1980      8      NA 2979.23 338.06 1082.40 22628.22 13027 210.2   4.0            NA
    # 811 WYOMING 1981      8 4572.67 3005.62 379.19 1187.86 26330.20 13717 223.5   4.1     0.3704301
    # 812 WYOMING 1982      8 4731.98 3060.64 408.43 1262.90 27724.96 13056 217.7   5.8     0.3595080
    # 813 WYOMING 1983      8 4950.82 3119.98 445.59      NA 28586.46 11922    NA   8.4            NA
    # 814 WYOMING 1984      8 5184.73 3195.68 476.57      NA 28794.80 12073 204.3   6.3     0.3199823
    # 815 WYOMING 1985      8 5448.38 3295.92 523.01 1629.45 29326.94 12022    NA   7.1            NA
    # 816 WYOMING 1986      8 5700.41 3400.96 565.58 1733.88 27110.51    NA 196.3   9.0            NA