Regression with Panel Data (Wide)?

I am trying to run the following model of GDP on population:

GDP_(i,t) = alpha + beta*Population_(i,t) + epsilon

Here, each variable is indexed by time (t) and country (i).

I have a panel dataset df1 in a wide format like this:

UK_gdp <- c(4.1, 4.2, 3.8, 4.0)
US_gdp <- c(4.1, 4.2, 3.8, 4.0)
US_pop <- c(220, 230, 240, 260)
UK_pop <- c(40, 45, 47, 49)
year <- c("1965-01-01", "1966-01-01", "1967-01-01", "1968-01-01")
df1 <- tibble(UK_gdp, US_gdp, US_pop, UK_pop, year)

I would like to run the above regression using the columns UK_gdp, US_gdp for the GDP_(i,t) variable, and the columns US_pop, UK_pop as data for the population_(i,t) variable. Is there any way to use the data for both countries in a regression? I do not want to run separate regressions for each country, but instead include all the data in the model when run the regression. I am not sure how to do this.

Solution

First, you want to reshape your data into long format.

> df1$year <- strftime(df1$year, '%Y')  ## this leaves just year from the date
> df1_l <- reshape(df1, varying=list(c("UK_gdp", "US_gdp"), c("US_pop", "UK_pop")),
+                  v.names=c('gdp', 'pop'), times=c('UK', 'US'), timevar='country',
+                  idvar='year', direction='long') |> `rownames<-`(NULL)
> df1_l
  year country gdp pop
1 1965      UK 4.1 220
2 1966      UK 4.2 230
3 1967      UK 3.8 240
4 1968      UK 4.0 260
5 1965      US 4.1  40
6 1966      US 4.2  45
7 1967      US 3.8  47
8 1968      US 4.0  49

The equation you show, is actually just an OLS regression, pooled across all entities and time periods.

GDP(i,t) = alpha + beta * Population(i,t) + epsilon

> fit1 <- lm(gdp ~ pop, df1_l)
> summary(fit1)$coefficients
                 Estimate   Std. Error    t value     Pr(>|t|)
(Intercept)  4.0338312297 0.1068459675 37.7537059 2.305458e-08
pop         -0.0000624667 0.0006237552 -0.1001462 9.234907e-01

However, a better idea might be to use "country" as a fixed effect, i.e.

GDP(i,t) = alpha_i + beta * Population(i,t) + epsilon(i,t)

> fit2 <- lfe::felm(gdp ~ pop | country, df1_l)
> summary(fit2)$coefficients
        Estimate  Std. Error    t value  Pr(>|t|)
pop -0.005082903 0.005734687 -0.8863435 0.4160214

Since error terms of countries are correlated, you probably should use clustered standard errors,

> fit3 <- lfe::felm(gdp ~ pop | country | 0 | country, df1_l)
> summary(fit3)$coefficients
        Estimate Cluster s.e.  t value  Pr(>|t|)
pop -0.005082903  0.001638335 -3.10248 0.1985035

Finally, there might also be time trends (year effects), which would look like this (not working with this small example dataset):

GDP(i,t) = alpha_i + gamma_t + beta * Population(i,t) + epsilon(i,t)

> fit4 <- lfe::felm(gdp ~ pop | country + year | 0 | country, df1_l)

Data:

> dput(df1)
structure(list(UK_gdp = c(4.1, 4.2, 3.8, 4), US_gdp = c(4.1, 
4.2, 3.8, 4), US_pop = c(220, 230, 240, 260), UK_pop = c(40, 
45, 47, 49), year = c("1965-01-01", "1966-01-01", "1967-01-01", 
"1968-01-01")), class = "data.frame", row.names = c(NA, -4L))