Search code examples
rregressionlaglm

Lagged linear model to identify correlation in age frequency data


I have this data and I'm trying to do a lagged linear regression in r to determine if the number of YOY's is significantly correlated to numbers of 1 year olds the next year, and 2 year olds the year after that... etc...

data:

structure(list(Year = c("2008", "2009", "2010", "2011", "2012", 
"2013", "2014", "2015", "2016", "2017", "2018", "2007", "2007", 
"2007", "2007", "2008", "2008", "2008", "2009", "2009", "2009", 
"2009", "2009", "2009", "2009", "2010", "2010", "2010", "2010", 
"2010", "2011", "2011", "2011", "2011", "2011", "2011", "2011", 
"2011", "2011", "2012", "2012", "2012", "2012", "2012", "2012", 
"2012", "2012", "2013", "2013", "2013", "2013", "2013", "2013", 
"2013", "2013", "2014", "2014", "2014", "2014", "2014", "2014", 
"2014", "2014", "2014", "2015", "2015", "2015", "2015", "2015", 
"2015", "2015", "2015", "2015", "2016", "2016", "2016", "2016", 
"2016", "2016", "2016", "2017", "2017", "2017", "2017", "2017", 
"2017", "2017", "2018", "2018", "2018", "2018", "2018", "2018", 
"2018", "2018"), Age = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 2L, 3L, 6L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 2L, 3L, 4L, 5L, 10L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 
9L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 
8L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 5L, 6L, 
7L, 8L, 10L, 2L, 3L, 4L, 5L, 6L, 7L, 10L, 1L, 2L, 3L, 4L, 5L, 
6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L), .Label = c("0", "1", 
"2", "3", "4", "5", "6", "7", "8", "9"), class = "factor"), n = c(166, 
28, 34, 77, 170, 18, 3, 22, 43, 50, 151, 1, 8, 17, 1, 4, 19, 
1, 1, 46, 37, 52, 5, 1, 1, 19, 41, 15, 16, 1, 1, 13, 4, 26, 12, 
11, 1, 1, 1, 1, 87, 15, 13, 27, 13, 17, 1, 1, 32, 30, 3, 4, 1, 
1, 1, 1, 24, 15, 23, 6, 2, 1, 2, 2, 4, 18, 13, 31, 28, 3, 3, 
6, 1, 4, 6, 1, 5, 9, 1, 1, 1, 16, 16, 8, 1, 1, 4, 1, 12, 4, 7, 
2, 1, 2, 1), id = c("YOY", "YOY", "YOY", "YOY", "YOY", "YOY", 
"YOY", "YOY", "YOY", "YOY", "YOY", "Adult", "Adult", "Adult", 
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult", 
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult", 
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult", 
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult", 
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult", 
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult", 
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult", 
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult", 
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult", 
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult", 
"Adult", "Adult", "Adult", "Adult", "Adult", "Adult", "Adult", 
"Adult", "Adult", "Adult", "Adult")), row.names = c(NA, -95L), class = "data.frame")

I made great plot that shows it sure does look like there is something here. Not perfect but some kind of relationship.

# Frequencey density plot of ages over year
ggplot(wi.age.count, aes(x=Year, y=Age)) + 
  geom_point(aes(cex = n, color = id)) +
  #scale_fill_brewer(palette="Set1") +
  labs(title = "Age frequency plot", subtitle = "Hogfish", y = "Age", x = "Year") +
  scale_size(range = c(1,10), breaks=c(1,2, 5, 10, 20, 40, 60, 80, 110, 150)) +
  theme_bw() + theme(panel.border = element_blank(), panel.grid.major = element_blank(),
                     panel.grid.minor = element_blank(), axis.line = element_line(colour = "black"))

As a bonus, if anyone knows how to put diagonal lines through to data from age, year to age+1, year+1 etc... that would be great.

My lagged linear code is terrible and I've spent weeks reading literature and stack questions. I can show you more of my failed attempts if you'd like.

here is one attempt

# linear model
l.fit <- lm(wi.age.count$Year ~ wi.age.count$n + lag(wi.age.count$Year, +1)); par(mfrow=c(1,2)) 
AIC.l.fit <- signif(AIC(l.fit), digits = 3)
plot(wi.age.count$Year ~ wi.age.count$n, pch = 2, type="b", xlab = 'Year', ylab = 'Age Frequency', xlim=range(age.hog$Year), ylim=range(c(0,age.hog$n)), main="Hogfish")
abline(l.fit, lwd=3, lty=3); legend (0, 700, paste("AIC =", AIC.l.fit), bty = 'n')
hist(residuals(l.fit), xlab='Residuals', main='Quality check')
summary(l.fit)

I'm not even really sure which is most appropriate, lagged linear model or an ARIMA or acf() or something entirely different. One of the issues is that I have 3 dimensions... Year, Age, and number at age. Any help would be greatly appreciated.

Sources I have tried to emulate, on top of all the scientific literature.

R adding lagged variable to sarima model

R - predicting simple dyn model with one lag term

Iteratively forecasting dyn models

Issue when attempting to run a Distributed Lag model in R using dynlm

Comparing linear regressions with a factor and lagged predictors, using R

R: How to fit a time series model such as "Y(t) = αX + βY(t-1)"?

Lagged regression in R: determining the optimal lag

The data should look like this... without the first couple years. enter image description here


Solution

  • I am adding another answer upon your comment on 7.27.2020. The plot does not have numbers, but it gives some idea about the numbers that I should have in the IVS matrix. Please try the following code and see if it makes sense.

    tmp = wi.age.count[order(wi.age.count$Age), ]
    ivs = reshape(tmp[which(tmp$Age != 0), -4], direction = "wide", idvar = "Year", timevar = "Age")
    ivs[is.na(ivs)] = 0
    
    > ivs
       Year n.1 n.2 n.3 n.4 n.5 n.6 n.7 n.8 n.9
    13 2007   8  17   0   0   1   0   0   0   0
    16 2008   4  19   1   0   0   0   0   0   0
    20 2009  46  37  52   5   1   1   0   0   0
    26 2010  19  41  15  16   0   0   0   0   1
    32 2011  13   4  26  12  11   1   1   1   0
    41 2012  87  15  13  27  13  17   1   0   0
    49 2013  32  30   3   4   1   1   1   0   0
    57 2014  24  15  23   6   2   1   2   2   0
    66 2015  18  13  31  28   3   3   6   0   1
    74 2016   4   6   1   5   9   1   0   0   1
    82 2017  16  16   8   1   1   4   0   0   0
    89 2018  12   4   7   2   1   2   1   0   0
    

    This is your ivs matrix. Does that look correct?

    Everything else is the same. Here is your dv matrix:

    dv = wi.age.count[which(wi.age.count$id == "YOY"), c(1, 3)]
    
    > dv
       Year   n
    1  2008 166
    2  2009  28
    3  2010  34
    4  2011  77
    5  2012 170
    6  2013  18
    7  2014   3
    8  2015  22
    9  2016  43
    10 2017  50
    11 2018 151
    

    And your formula with three lags.

    formula = ""
    for (i in 2:4) formula = paste(formula, "+",  names(ivs)[i])
    formula = paste("n ~", substr(formula, 4, nchar(formula)))
    
    > formula
    [1] "n ~ n.1 + n.2 + n.3"
    

    And here are the results:

    l.fit = lm(formula, merge(dv, ivs))
    AIC.l.fit <- signif(AIC(l.fit), digits = 3)
    summary(l.fit)
    
    Call:
    lm(formula = formula, data = merge(dv, ivs))
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -60.367 -38.028   8.698  23.763  96.257 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)  
    (Intercept)  99.8976    36.1761   2.761    0.028 *
    n.1           1.1059     0.8388   1.318    0.229  
    n.2          -1.7339     1.5773  -1.099    0.308  
    n.3          -1.6346     1.2932  -1.264    0.247  
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 59.48 on 7 degrees of freedom
    Multiple R-squared:  0.3731,    Adjusted R-squared:  0.1044 
    F-statistic: 1.389 on 3 and 7 DF,  p-value: 0.3233
    
    > AIC.l.fit
    [1] 126