Search code examples
rdatasetregressiontibblepanel-data

Regression with Panel Data (Wide)?


I am trying to run the following model of GDP on population:

GDP_(i,t) = alpha + beta*Population_(i,t) + epsilon

Here, each variable is indexed by time (t) and country (i).

I have a panel dataset df1 in a wide format like this:

UK_gdp <- c(4.1, 4.2, 3.8, 4.0)
US_gdp <- c(4.1, 4.2, 3.8, 4.0)
US_pop <- c(220, 230, 240, 260)
UK_pop <- c(40, 45, 47, 49)
year <- c("1965-01-01", "1966-01-01", "1967-01-01", "1968-01-01")
df1 <- tibble(UK_gdp, US_gdp, US_pop, UK_pop, year)

I would like to run the above regression using the columns UK_gdp, US_gdp for the GDP_(i,t) variable, and the columns US_pop, UK_pop as data for the population_(i,t) variable. Is there any way to use the data for both countries in a regression? I do not want to run separate regressions for each country, but instead include all the data in the model when run the regression. I am not sure how to do this.


Solution

  • First, you want to reshape your data into long format.

    > df1$year <- strftime(df1$year, '%Y')  ## this leaves just year from the date
    > df1_l <- reshape(df1, varying=list(c("UK_gdp", "US_gdp"), c("US_pop", "UK_pop")),
    +                  v.names=c('gdp', 'pop'), times=c('UK', 'US'), timevar='country',
    +                  idvar='year', direction='long') |> `rownames<-`(NULL)
    > df1_l
      year country gdp pop
    1 1965      UK 4.1 220
    2 1966      UK 4.2 230
    3 1967      UK 3.8 240
    4 1968      UK 4.0 260
    5 1965      US 4.1  40
    6 1966      US 4.2  45
    7 1967      US 3.8  47
    8 1968      US 4.0  49
    

    The equation you show, is actually just an OLS regression, pooled across all entities and time periods.

    GDP(i,t) = alpha + beta * Population(i,t) + epsilon
    

    > fit1 <- lm(gdp ~ pop, df1_l)
    > summary(fit1)$coefficients
                     Estimate   Std. Error    t value     Pr(>|t|)
    (Intercept)  4.0338312297 0.1068459675 37.7537059 2.305458e-08
    pop         -0.0000624667 0.0006237552 -0.1001462 9.234907e-01
    

    However, a better idea might be to use "country" as a fixed effect, i.e.

    GDP(i,t) = alpha_i + beta * Population(i,t) + epsilon(i,t)
    

    > fit2 <- lfe::felm(gdp ~ pop | country, df1_l)
    > summary(fit2)$coefficients
            Estimate  Std. Error    t value  Pr(>|t|)
    pop -0.005082903 0.005734687 -0.8863435 0.4160214
    

    Since error terms of countries are correlated, you probably should use clustered standard errors,

    > fit3 <- lfe::felm(gdp ~ pop | country | 0 | country, df1_l)
    > summary(fit3)$coefficients
            Estimate Cluster s.e.  t value  Pr(>|t|)
    pop -0.005082903  0.001638335 -3.10248 0.1985035
    

    Finally, there might also be time trends (year effects), which would look like this (not working with this small example dataset):

    GDP(i,t) = alpha_i + gamma_t + beta * Population(i,t) + epsilon(i,t)
    

    > fit4 <- lfe::felm(gdp ~ pop | country + year | 0 | country, df1_l)
    

    Data:

    > dput(df1)
    structure(list(UK_gdp = c(4.1, 4.2, 3.8, 4), US_gdp = c(4.1, 
    4.2, 3.8, 4), US_pop = c(220, 230, 240, 260), UK_pop = c(40, 
    45, 47, 49), year = c("1965-01-01", "1966-01-01", "1967-01-01", 
    "1968-01-01")), class = "data.frame", row.names = c(NA, -4L))