I have data with variables industry_sales
, year
, and industry
.
I want to conduct time series regressions for each industry in rolling windows of 5 years, with the dependent variable being industry_sales
and the independent variable being year
. For example, if data ranges from years 2000-2010, then the rolling windows for each regression and industry group would be 2000-2004, 2001-2005, and so on.
I had an attempt but the code fails:
library(tidyverse)
library(zoo)
# Group the data by industry
data_grouped <- data %>%
group_by(industry)
glimpse(data_grouped)
# Define a function to run a regression
regression_func <- function(x) {
lm(industry_sales ~ year, data = x)
}
# Perform rolling linear regression on each group, with a window of 5 years
results_list <- data_grouped %>%
do({
rollapply(data_grouped$year, width = 5, FUN = regression_func, by = 1)
})
dput() output of the data is as below, any help would be appreciated:
structure(list(year = c(2010, 2011, 2012, 2013, 2014, 2015, 2016,
2017, 2018, 2019, 2020, 2021, 2022, 2010, 2011, 2012, 2013, 2014,
2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2010, 2011, 2012,
2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2010,
2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021,
2022, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019,
2020, 2021, 2022, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017,
2018, 2019, 2020, 2021, 2022, 2010, 2011, 2012, 2013, 2014, 2015,
2016, 2017, 2018, 2019, 2020, 2021, 2022, 2010, 2011, 2012, 2013,
2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022), industry_sales = c(853.218,
1001.856, 987.741, 990.679, 1061.804, 985.904, 1073.43, 1305.779,
1314.722, 1188.122, 1327.245, 1711.143, 577.358, 25731.969, 26484.263,
23085.259, 24842.509, 30563.672, 34947.361, 40066.425, 38293.972,
40484.096, 37160.342, 29109.755, 35227.521, 16522.915, 1072310.255,
1314543.982, 1288932.701, 1245467.184, 1169743.543, 882325.562,
804108.99, 966832.252, 1099655.803, 999685.99, 725094.688, 797072.967,
59906.738, 131274.107, 167635.266, 170003.001, 161892.01, 149722.289,
108169.702, 92950.89, 120837.727, 132130.143, 121670.84, 118206.596,
147732.623, 542.17, 201842.321, 209721.793, 215722.823, 223158.977,
221191.213, 223076.061, 229199.539, 242995.929, 256145.995, 257282.085,
234455.443, 214560.191, 13964.787, 88190.421, 90191.422, 80848.451,
93999.434, 92325.688, 99138.466, 98682.173, 95106.591, 97622.097,
94254.621, 81145.338, 87421.329, 12116.901, 247825.263, 262355,
235423.37, 236697.987, 245287.766, 222221.931, 219657.82, 222990.939,
224672.689, 162633.722, 139853.429, 148534.588, 4463.853, 28177.129,
29298.618, 29251.077, 31696.968, 31636.033, 33064.57, 35745.408,
32034.079, 31978.646, 31661.88, 28572.857, 34190.976, 3400.497
), industry = c("Agriculture, Forestry and Fishing", "Agriculture, Forestry and Fishing",
"Agriculture, Forestry and Fishing", "Agriculture, Forestry and Fishing",
"Agriculture, Forestry and Fishing", "Agriculture, Forestry and Fishing",
"Agriculture, Forestry and Fishing", "Agriculture, Forestry and Fishing",
"Agriculture, Forestry and Fishing", "Agriculture, Forestry and Fishing",
"Agriculture, Forestry and Fishing", "Agriculture, Forestry and Fishing",
"Agriculture, Forestry and Fishing", "Construction", "Construction",
"Construction", "Construction", "Construction", "Construction",
"Construction", "Construction", "Construction", "Construction",
"Construction", "Construction", "Construction", "Manufacturing",
"Manufacturing", "Manufacturing", "Manufacturing", "Manufacturing",
"Manufacturing", "Manufacturing", "Manufacturing", "Manufacturing",
"Manufacturing", "Manufacturing", "Manufacturing", "Manufacturing",
"Mining", "Mining", "Mining", "Mining", "Mining", "Mining", "Mining",
"Mining", "Mining", "Mining", "Mining", "Mining", "Mining", "Retail Trade",
"Retail Trade", "Retail Trade", "Retail Trade", "Retail Trade",
"Retail Trade", "Retail Trade", "Retail Trade", "Retail Trade",
"Retail Trade", "Retail Trade", "Retail Trade", "Retail Trade",
"Services", "Services", "Services", "Services", "Services", "Services",
"Services", "Services", "Services", "Services", "Services", "Services",
"Services", "Transportation, Communications, Electric, Gas and Sanitary service",
"Transportation, Communications, Electric, Gas and Sanitary service",
"Transportation, Communications, Electric, Gas and Sanitary service",
"Transportation, Communications, Electric, Gas and Sanitary service",
"Transportation, Communications, Electric, Gas and Sanitary service",
"Transportation, Communications, Electric, Gas and Sanitary service",
"Transportation, Communications, Electric, Gas and Sanitary service",
"Transportation, Communications, Electric, Gas and Sanitary service",
"Transportation, Communications, Electric, Gas and Sanitary service",
"Transportation, Communications, Electric, Gas and Sanitary service",
"Transportation, Communications, Electric, Gas and Sanitary service",
"Transportation, Communications, Electric, Gas and Sanitary service",
"Transportation, Communications, Electric, Gas and Sanitary service",
"Wholesale Trade", "Wholesale Trade", "Wholesale Trade", "Wholesale Trade",
"Wholesale Trade", "Wholesale Trade", "Wholesale Trade", "Wholesale Trade",
"Wholesale Trade", "Wholesale Trade", "Wholesale Trade", "Wholesale Trade",
"Wholesale Trade")), row.names = c(NA, -104L), class = c("tbl_df",
"tbl", "data.frame"))
There are several problems:
1) We will assume that for each regression we want one row with the industry, last year of the window, intercept, slope and R^2. Modify c(...) to change which statistics you want. For example, c(year = tail(yr, 1), unlist(broom::glance(fm)))
if you wanted a large number of statistics.
This groups by industry
and then uses group_modify
to create the rows for each industry.
library(dplyr, exclude = c("filter", "lag"))
library(zoo)
out <- data %>%
group_by(industry) %>%
group_modify(~ {
.x$year %>%
rollapplyr(5, function(yr) {
fm <- lm(industry_sales ~ year, .x, subset = year %in% yr)
c(year = tail(yr, 1), intercept = coef(fm)[[1]], slope = coef(fm)[[2]],
r.squared = summary(fm)$r.squared)
}) %>%
as.data.frame
}) %>%
ungroup
giving:
> out
# A tibble: 72 × 5
industry year intercept slope r.squared
<chr> <dbl> <dbl> <dbl> <dbl>
1 Agriculture, Forestry and Fishing 2014 -80707. 40.6 0.704
2 Agriculture, Forestry and Fishing 2015 -7481. 4.22 0.0433
3 Agriculture, Forestry and Fishing 2016 -32534. 16.7 0.362
4 Agriculture, Forestry and Fishing 2017 -128244. 64.2 0.605
5 Agriculture, Forestry and Fishing 2018 -165315. 82.6 0.741
6 Agriculture, Forestry and Fishing 2019 -129070. 64.6 0.503
7 Agriculture, Forestry and Fishing 2020 -77455. 39.0 0.317
8 Agriculture, Forestry and Fishing 2021 -164845. 82.3 0.428
9 Agriculture, Forestry and Fishing 2022 193469. -95.2 0.134
10 Construction 2014 -1587815. 802. 0.208
# … with 62 more rows
# ℹ Use `print(n = ...)` to see more rows
2) If you just wanted a single statistic -- the slope, say -- then it might be more convenient to have a 2d zoo series with one column per industry and one row per year. Note that the slope is shift invariant so we can just use 1:5 as the predictor in each case and we will get the same slope as if we had used year.
library(zoo)
z <- read.zoo(data, split = "industry")
out2 <- rollapplyr(z, 5, function(x) coef(lm(x ~ seq(5)))[[2]])
giving:
> out2[1:3, 1:3]
Agriculture, Forestry and Fishing Construction Manufacturing
2014 40.5995 802.1652 12578.98
2015 4.2159 2440.4609 -98362.60
2016 16.6603 4406.7184 -133278.90
If you wanted this as a data frame use fortify.zoo(out2)
or in long form fortify.zoo(out2, melt = TRUE)
.