Search code examples
rdataframedplyrdata.tabletidyverse

How to calculate LAG difference in R - if periods changes or how to add empty NA rows to dataframe?


If i have the following dataframe:

tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END",
             "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2011END","2012END"),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube"
  ),
  values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30)
)

And would like to find the lag of the values so that i produce the following dataframe:

tibble(
  period = c("2010END", "2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END",
             "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2011END","2012END"),
  website = c(
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube"
  ),
  values = c(1, 2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 20, 30), 
  output = c(NA, 1,NA,NA,NA,NA,NA, 5,5,5,5,5, NA, 10)
)

The periods are different - in that one period to lag by is clearly 5 - those going Q1,Q2,Q3,Q4,END and then a secondary lag is by period 2 - those going year_priorEND vs year_aheadEND and maybe even further.

Alternatively:

Instead would it be easier to just impute rows for these missing dates - for example making a condition that says, if there is no website which has a period of 5 (meaning that 5 values exist Q1,Q2,Q3,Q4,END) then generate the remainder of the rows for that website and period but have the values as NA, so something like this can be generated instead:

tibble(
  period = c("2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END", 
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END",
             "2010Q1","2010Q2","2010Q3","2010Q4","2010END", "2011Q1","2011Q2","2011Q3","2011Q4","2011END"),
  website = c(
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "google",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "facebook",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube",
    "youtube"
  ),
  values = c(NA,NA,NA,NA,1, NA,NA,NA,NA,2, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, NA,NA,NA,NA,20, NA,NA,NA,NA,30), 
  output = c(NA,NA,NA,NA,NA,NA,NA,NA,NA, 1,NA,NA,NA,NA,NA, 5,5,5,5,5, NA,NA,NA,NA,NA,NA,NA,NA,NA, 10)
)

So without explicitly coding out which fields need to imputed - i assume some form of checking can occur per group? Because in this case we can just use output = lag(values, 5) as the periods are all consistent


Solution

  • If I understand correctly, the OP wants a calculate the year-on-year differences of the values of a period and the same period in the prior year for each website. There is a related question Calculating Lag for dataframe in R? where the OP explicitely asks to calculate 2011Q1 - 2010Q1 and so on including 2011END - 2010END.

    Using lag() will only work if the time sequences are complete and the number of positions to lag is always constant. This is not the case for the given dataset.

    Therefore, I suggest to use an update self-join:

    library(data.table)
    setDT(inp)[, c("year", "qtr") := tstrsplit(period, "(?<=^\\d{4})", perl = TRUE, 
                                               type.convert = TRUE)][
                                                 , prior_year := year - 1L]
    inp[inp, on = .(prior_year = year, qtr, website), output := x.values - i.values][]
    
         period  website values year qtr prior_year output
     1: 2010END   google      1 2010 END       2009     NA
     2: 2011END   google      2 2011 END       2010      1
     3:  2010Q1 facebook      1 2010  Q1       2009     NA
     4:  2010Q2 facebook      2 2010  Q2       2009     NA
     5:  2010Q3 facebook      3 2010  Q3       2009     NA
     6:  2010Q4 facebook      4 2010  Q4       2009     NA
     7: 2010END facebook      5 2010 END       2009     NA
     8:  2011Q1 facebook      6 2011  Q1       2010      5
     9:  2011Q2 facebook      7 2011  Q2       2010      5
    10:  2011Q3 facebook      8 2011  Q3       2010      5
    11:  2011Q4 facebook      9 2011  Q4       2010      5
    12: 2011END facebook     10 2011 END       2010      5
    13: 2011END  youtube     20 2011 END       2010     NA
    14: 2012END  youtube     30 2012 END       2011     10
    

    Explanation

    1. period is split in two parts, year and qtr, year is coerced to type integer. The regular expression which is used for splitting uses a positive lookbehind zero-length assertion to split the strings right after the first four digits.
    2. prior_year is computed adn appended to the dataset.
    3. In the update self-join, for the same website and qtr year is matched with prior_year. The difference in value is calcaluated and appended as new column output.