I have this data and I'm trying to do a lagged linear regression in r to determine if the number of YOY's is significantly correlated to numbers of 1 year olds the next year, and 2 year olds the year after that... etc...
data:
structure(list(Year = c("2008", "2009", "2010", "2011", "2012",
"2013", "2014", "2015", "2016", "2017", "2018", "2007", "2007",
"2007", "2007", "2008", "2008", "2008", "2009", "2009", "2009",
"2009", "2009", "2009", "2009", "2010", "2010", "2010", "2010",
"2010", "2011", "2011", "2011", "2011", "2011", "2011", "2011",
"2011", "2011", "2012", "2012", "2012", "2012", "2012", "2012",
"2012", "2012", "2013", "2013", "2013", "2013", "2013", "2013",
"2013", "2013", "2014", "2014", "2014", "2014", "2014", "2014",
"2014", "2014", "2014", "2015", "2015", "2015", "2015", "2015",
"2015", "2015", "2015", "2015", "2016", "2016", "2016", "2016",
"2016", "2016", "2016", "2017", "2017", "2017", "2017", "2017",
"2017", "2017", "2018", "2018", "2018", "2018", "2018", "2018",
"2018", "2018"), Age = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 2L, 3L, 6L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 2L, 3L, 4L, 5L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L,
8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 10L, 2L, 3L, 4L, 5L, 6L, 7L, 10L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L), .Label = c("0", "1",
"2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), n = c(166,
28, 34, 77, 170, 18, 3, 22, 43, 50, 151, 1, 8, 17, 1, 4, 19,
1, 1, 46, 37, 52, 5, 1, 1, 19, 41, 15, 16, 1, 1, 13, 4, 26, 12,
11, 1, 1, 1, 1, 87, 15, 13, 27, 13, 17, 1, 1, 32, 30, 3, 4, 1,
1, 1, 1, 24, 15, 23, 6, 2, 1, 2, 2, 4, 18, 13, 31, 28, 3, 3,
6, 1, 4, 6, 1, 5, 9, 1, 1, 1, 16, 16, 8, 1, 1, 4, 1, 12, 4, 7,
2, 1, 2, 1), id = c("YOY", "YOY", "YOY", "YOY", "YOY", "YOY",
"YOY", "YOY", "YOY", "YOY", "YOY", "Adult", "Adult", "Adult",
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult",
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult",
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult",
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult",
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult",
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult",
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult",
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult",
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult",
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult",
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult",
"Adult", "Adult", "Adult", "Adult")), row.names = c(NA, -95L), class = "data.frame")
I made great plot that shows it sure does look like there is something here. Not perfect but some kind of relationship.
# Frequencey density plot of ages over year
ggplot(wi.age.count, aes(x=Year, y=Age)) +
geom_point(aes(cex = n, color = id)) +
#scale_fill_brewer(palette="Set1") +
labs(title = "Age frequency plot", subtitle = "Hogfish", y = "Age", x = "Year") +
scale_size(range = c(1,10), breaks=c(1,2, 5, 10, 20, 40, 60, 80, 110, 150)) +
theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),
panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))
As a bonus, if anyone knows how to put diagonal lines through to data from age, year to age+1, year+1 etc... that would be great.
My lagged linear code is terrible and I've spent weeks reading literature and stack questions. I can show you more of my failed attempts if you'd like.
here is one attempt
# linear model
l.fit <- lm(wi.age.count$Year ~ wi.age.count$n + lag(wi.age.count$Year, +1)); par(mfrow=c(1,2))
AIC.l.fit <- signif(AIC(l.fit), digits = 3)
plot(wi.age.count$Year ~ wi.age.count$n, pch = 2, type="b", xlab = 'Year', ylab = 'Age Frequency', xlim=range(age.hog$Year), ylim=range(c(0,age.hog$n)), main="Hogfish")
abline(l.fit, lwd=3, lty=3); legend (0, 700, paste("AIC =", AIC.l.fit), bty = 'n')
hist(residuals(l.fit), xlab='Residuals', main='Quality check')
summary(l.fit)
I'm not even really sure which is most appropriate, lagged linear model or an ARIMA or acf()
or something entirely different. One of the issues is that I have 3 dimensions... Year, Age, and number at age. Any help would be greatly appreciated.
Sources I have tried to emulate, on top of all the scientific literature.
R adding lagged variable to sarima model
R - predicting simple dyn model with one lag term
Iteratively forecasting dyn models
Issue when attempting to run a Distributed Lag model in R using dynlm
Comparing linear regressions with a factor and lagged predictors, using R
R: How to fit a time series model such as "Y(t) = αX + βY(t-1)"?
Lagged regression in R: determining the optimal lag
The data should look like this... without the first couple years.
I am adding another answer upon your comment on 7.27.2020. The plot does not have numbers, but it gives some idea about the numbers that I should have in the IVS matrix. Please try the following code and see if it makes sense.
tmp = wi.age.count[order(wi.age.count$Age), ]
ivs = reshape(tmp[which(tmp$Age != 0), -4], direction = "wide", idvar = "Year", timevar = "Age")
ivs[is.na(ivs)] = 0
> ivs
Year n.1 n.2 n.3 n.4 n.5 n.6 n.7 n.8 n.9
13 2007 8 17 0 0 1 0 0 0 0
16 2008 4 19 1 0 0 0 0 0 0
20 2009 46 37 52 5 1 1 0 0 0
26 2010 19 41 15 16 0 0 0 0 1
32 2011 13 4 26 12 11 1 1 1 0
41 2012 87 15 13 27 13 17 1 0 0
49 2013 32 30 3 4 1 1 1 0 0
57 2014 24 15 23 6 2 1 2 2 0
66 2015 18 13 31 28 3 3 6 0 1
74 2016 4 6 1 5 9 1 0 0 1
82 2017 16 16 8 1 1 4 0 0 0
89 2018 12 4 7 2 1 2 1 0 0
This is your ivs matrix. Does that look correct?
Everything else is the same. Here is your dv matrix:
dv = wi.age.count[which(wi.age.count$id == "YOY"), c(1, 3)]
> dv
Year n
1 2008 166
2 2009 28
3 2010 34
4 2011 77
5 2012 170
6 2013 18
7 2014 3
8 2015 22
9 2016 43
10 2017 50
11 2018 151
And your formula with three lags.
formula = ""
for (i in 2:4) formula = paste(formula, "+", names(ivs)[i])
formula = paste("n ~", substr(formula, 4, nchar(formula)))
> formula
[1] "n ~ n.1 + n.2 + n.3"
And here are the results:
l.fit = lm(formula, merge(dv, ivs))
AIC.l.fit <- signif(AIC(l.fit), digits = 3)
summary(l.fit)
Call:
lm(formula = formula, data = merge(dv, ivs))
Residuals:
Min 1Q Median 3Q Max
-60.367 -38.028 8.698 23.763 96.257
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 99.8976 36.1761 2.761 0.028 *
n.1 1.1059 0.8388 1.318 0.229
n.2 -1.7339 1.5773 -1.099 0.308
n.3 -1.6346 1.2932 -1.264 0.247
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 59.48 on 7 degrees of freedom
Multiple R-squared: 0.3731, Adjusted R-squared: 0.1044
F-statistic: 1.389 on 3 and 7 DF, p-value: 0.3233
> AIC.l.fit
[1] 126