Search code examples
rdplyrdata.tabletidyverse

Calculating Lag for dataframe in R?


If I have the following dataframe/tibble in R:

tibble(
  period =
    c(
      "2010Q1",
      "2010Q2",
      "2010Q3",
      "2010Q4",
      "2010END",
      "2011Q1",
      "2011Q2",
      "2011Q3",
      "2011Q4",
      "2011END",
      "2012Q1",
      "2012Q2",
      "2012Q3",
      "2012Q4",
      "20120END",
      "2010Q1",
      "2010Q2",
      "2010Q3",
      "2010Q4",
      "2010END",
      "2011Q1",
      "2011Q2",
      "2011Q3",
      "2011Q4",
      "2011END",
      "2012Q1",
      "2012Q2",
      "2012Q3",
      "2012Q4",
      "20120END"),
  website = c(
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook"
  ), 
  values = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1)
)

How can i perform the lag calculation for each period value from the year prior e.g I would like to create the calculation for 2011Q1 - 2010Q1 and so on including 2011END - 2010END

So that i get a table which looks like the following:

tibble(
  period =
    c(
      "2010Q1",
      "2010Q2",
      "2010Q3",
      "2010Q4",
      "2010END",
      "2011Q1",
      "2011Q2",
      "2011Q3",
      "2011Q4",
      "2011END",
      "2012Q1",
      "2012Q2",
      "2012Q3",
      "2012Q4",
      "20120END",
      "2010Q1",
      "2010Q2",
      "2010Q3",
      "2010Q4",
      "2010END",
      "2011Q1",
      "2011Q2",
      "2011Q3",
      "2011Q4",
      "2011END",
      "2012Q1",
      "2012Q2",
      "2012Q3",
      "2012Q4",
      "20120END"),
  website = c(
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook"
  ), 
  values = c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1), 
  calculation = c(NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,14,12,10,8,6,4,2,0,-2,-4,-6,-8,-10,-12,-14))

Here it makes sense we cannot compare to the period prior so it is NA - for the year 2011 all values are calculated like this:

  • 2010q1 - NA (nothing before)
  • 2010q2 - NA (nothing before)
  • 2010q3 - NA (nothing before)
  • 2010q4 - NA (nothing before)
  • 2010END - NA (nothing before)
  • 2011q1 - 2010q1
  • 2011q2 - 2010q2
  • 2011q3 - 2010q3
  • 2011q4 - 2010q4
  • 2011END - 2010END

Having some issues with using the lag() function when trying to group the data by the period column.


Solution

  • If every year is divided into 5 periods, lag(...,5) should take the value 5 rows earlier that the one you are calculating.

    example %>% 
      mutate(calculation = values - lag(values,5))
    

    Output:

    # A tibble: 30 x 3
       period  values calculation
       <chr>    <dbl> <dbl>
     1 2010Q1       1    NA
     2 2010Q2       2    NA
     3 2010Q3       3    NA
     4 2010Q4       4    NA
     5 2010END      5    NA
     6 2011Q1       6     5
     7 2011Q2       7     5
     8 2011Q3       8     5
     9 2011Q4       9     5
    10 2011END     10     5
    # ... with 20 more rows
    

    EDIT: As @AndrewGB accurately said, group_by(website) must be added to separate operations for each website. Also, I'm assuming that rows are already arranged by period.

    example %>%
    group_by(website) %>%
      mutate(calculation = values - lag(values,5))