Search code examples
rdplyrplm

Why does plm not like my dplyr-created dataframe?


If I perform simple and seemingly identical operations using, in one case, base R, and in the other case, dplyr, on two pdata.frames and then model them with lm(), I get the exact same results, as expected. If I then pass those datasets to plm(), the estimated model parameters (as well as the panel structure) differ between the datasets. Why would this be happening?

The toy example here illustrates my issue. Two panel dataframes, df_base and df_dplyr, are generated from a single source, df. When passed through lm(), both dataframes yield the same result. When passed through plm(), however, it appears that the panel structure becomes altered (see counts of n and T), resulting in differing estimation results.

Using R 4.2.3 with dplyr 1.1.1.

set.seed(1)

library(dplyr)
library(magrittr)
library(plm)

# Make toy dataframe
A = c(runif(100))
B = c(runif(100))
C = c(runif(100))
df <- data.frame(A,B,C)
df$id <- floor((as.numeric(rownames(df))-1)/10)
df$t <- ave(df$A, df$id, FUN = seq_along)

# Modify first copy of dataframe using base R
df_base <- pdata.frame(df, index = c('id','t')) 
df_base <- subset(df_base, (as.numeric(df_base$t)<8))

# Modify second copy of dataframe using dplyr
df_dplyr <- pdata.frame(df, index = c('id','t')) 
df_dplyr <- df_dplyr %>% 
  filter(as.numeric(t)<8) 

# Results are the same for lm()
print(summary(lm(A ~ B + C, data = df_base)))
print(summary(lm(A ~ B + C, data = df_dplyr)))

# Results differ for plm()
print(summary(plm(A ~ B + C,data = df_base, method = "within")))
print(summary(plm(A ~ B + C,data = df_dplyr, method = "within")))

Solution

  • dplyr is not "pdata.frame-friendly". A pdata.frame has an index attribute to enable panel operations and when subsetting rows, the index needs to be adjusted as well - this is what dpylr does not do.

    You can see that by:

    nrow(df_dplyr) # 70
    nrow(index(df_dplyr)) # 100
    
    nrow(df_base) # 70
    nrow(index(df_base)) # 70
    

    Now, to fix the scrambled data, just do:

    df_dplyr_fixed <- pdata.frame(as.data.frame(df_dplyr), c("id", "t"))
    print(summary(plm(A ~ B + C,data = df_dplyr_fixed)))