Search code examples
rdataframedplyrlag

Pulling lagged data but only for a particular season in R


I have a particular data set that includes two variables. One is numeric and the other is a character that identifies the season and year that the numeric data came from. This is what the head of the data looks like:

   SeasonYear  mean
   <chr>      <dbl>
 1 winter2000 0.957
 2 spring2000 0.943
 3 summer2000 1.03 
 4 fall2000   0.981
 5 winter2001 1.06 
 6 spring2001 1.05 
 7 summer2001 1.02 
 8 fall2001   1.03 
 9 winter2002 1.02 
10 spring2002 1.05 

Now I am looking to pull a lag of this data but only for the previous spring so that my data looks something like this:

SeasonYear  mean     lag
   <chr>      <dbl>  <dbl> 
 1 winter2000 0.957   NA
 2 spring2000 0.943   NA
 3 summer2000 1.03    0.943
 4 fall2000   0.981   0.943
 5 winter2001 1.06    0.943
 6 spring2001 1.05    0.943
 7 summer2001 1.02    1.05
 8 fall2001   1.03    1.05
 9 winter2002 1.02    1.05
10 spring2002 1.05    1.05

I am also looking to go back 2 springs as well so that my data looks something like this:

SeasonYear  mean     lag
   <chr>      <dbl>  <dbl> 
 1 winter2000 0.957   NA
 2 spring2000 0.943   NA
 3 summer2000 1.03    NA
 4 fall2000   0.981   NA
 5 winter2001 1.06    NA
 6 spring2001 1.05    NA
 7 summer2001 1.02    0.943
 8 fall2001   1.03    0.943
 9 winter2002 1.02    0.943
10 spring2002 1.05    0.943

I know I can use the lag() function to get previous data in the dataframe but I am looking for a way to specify a function that pulls a particular type of lag like the one I mentioned.


Solution

  • One option to achieve your desired result may look like so:

    1. Split your SeasonYear into season and year
    2. Add a column with the value for the spring in each year
    3. Get the n-th lag taking into account that for fall and summer its the (n-1)-th lag
    library(tidyr)
    library(dplyr)
    
    lag_spring <- function(x, y, n = 1) {
      data.frame(x = x, season_year = y) %>%
        tidyr::extract(season_year, into = c("season", "year"), regex = "^(.+?)(\\d{4})$") %>%
        group_by(year) %>%
        mutate(springmean = x[season == "spring"]) %>%
        ungroup() %>%
        group_by(season) %>%
        mutate(lag = ifelse(!season %in% c("summer", "fall"), lag(springmean, n = n), lag(springmean, n = n - 1))) %>%
        ungroup() %>%
        pull(lag)
    }
    
    dd %>%
      mutate(lag = lag_spring(mean, SeasonYear))
    #>    SeasonYear  mean   lag
    #> 1  winter2000 0.957    NA
    #> 2  spring2000 0.943    NA
    #> 3  summer2000 1.030 0.943
    #> 4    fall2000 0.981 0.943
    #> 5  winter2001 1.060 0.943
    #> 6  spring2001 1.050 0.943
    #> 7  summer2001 1.020 1.050
    #> 8    fall2001 1.030 1.050
    #> 9  winter2002 1.020 1.050
    #> 10 spring2002 1.050 1.050
    
    dd %>%
      mutate(lag = lag_spring(mean, SeasonYear, n = 2))
    #>    SeasonYear  mean   lag
    #> 1  winter2000 0.957    NA
    #> 2  spring2000 0.943    NA
    #> 3  summer2000 1.030    NA
    #> 4    fall2000 0.981    NA
    #> 5  winter2001 1.060    NA
    #> 6  spring2001 1.050    NA
    #> 7  summer2001 1.020 0.943
    #> 8    fall2001 1.030 0.943
    #> 9  winter2002 1.020 0.943
    #> 10 spring2002 1.050 0.943
    

    DATA

    dd <- structure(list(SeasonYear = c(
      "winter2000", "spring2000", "summer2000",
      "fall2000", "winter2001", "spring2001", "summer2001", "fall2001",
      "winter2002", "spring2002"
    ), mean = c(
      0.957, 0.943, 1.03, 0.981,
      1.06, 1.05, 1.02, 1.03, 1.02, 1.05
    )), class = "data.frame", row.names = c(
      "1",
      "2", "3", "4", "5", "6", "7", "8", "9", "10"
    ))