Search code examples
rdatetimelag

Compute returns using lagged values on date when there are more than one row for the same date in R


I have a dataset containing data for 6k asset and their marketprices.

I want to compute the daily returns, hence to apply the formula martketprice[i] - marketprice[i-1]/marketprice[i-1] The problem is that I have multiple observation for the same datetime, for example for asset x, I have 3 observation for the day t because it was traded by investor 1, 2 and 3. And so on so forth for every asset in the dataset. So my dataset can look like:

investor    asset    datetime      marketprice
1            x          t              10
2            x          t              10
3            x          t              10

My idea was to use something like

res <- res %>% 
  arrange(datetime) %>% 
  group_by(asset) %>% 
  mutate(ret = (marketprice - dplyr::lag(marketprice))/dplyr::lag(marketprice, default = NA)) %>% 
  ungroup()

but it doesn't work since, in the example above, for row 2 would mean use marketprice [i-1] which is the same day marketprice, while I want the previous day [t-1] to be used (not included in the example dataset)

Furthermore R should check that the [i-1] marketprice is not belonging to a date which is more than 4 days distant, hence if the date of row i is 10th of july, then the computation should apply only if the date [i-1] is 6th of july or closer.

Any idea?


Solution

  • Based on the following assumptions I understand:

    1. When repeated day for the same asset, the marketprice is the same, not matter the investor.
    2. You don't mind which investor was (so we can remove rows)
    3. When day (t) is 5 days or ahead from previous (t-1), a NaN output is ok.

    Libraries and some data example:

    library(lubridate)
    library(tidyverse)
    
    # Data example
    
    set.seed(132) # reproducibility
    
    example = data.frame(
      investor = c(rep(1,3),2,3,rep(2,2),1,
                   rep(2,4),rep(3,4)),
      asset = c(rep('A',8),
                rep('B',8)),
      datetime = c(today()+c(1,2,3,3,3,4,5,6),
                   today()+c(1,seq(6,9),seq(16,18))),
      marketprice = c(10,20,30,30,30,sample(c(10,20,30),11,replace = TRUE))
    )
    

    Example dataset has 2 assets. First one (A) shows how the code deals with several rows for the same day. Second (B) shows how the code deals when there is a jump in dates greater than 4 days.

    > example
       investor asset   datetime marketprice
    1         1     A 2022-05-26          10
    2         1     A 2022-05-27          20
    3         1     A 2022-05-28          30
    4         2     A 2022-05-28          30
    5         3     A 2022-05-28          30
    6         2     A 2022-05-29          30
    7         2     A 2022-05-30          30
    8         1     A 2022-05-31          30
    9         2     B 2022-05-26          20
    10        2     B 2022-05-31          10
    11        2     B 2022-06-01          20
    12        2     B 2022-06-02          10
    13        3     B 2022-06-03          10
    14        3     B 2022-06-10          30
    15        3     B 2022-06-11          20
    16        3     B 2022-06-12          10
    

    Dplyr code:

    # The formula is [price(t)-price(t-1)]/price(t-1) -> dif(price)/lag(price)
    ret = example %>% 
      group_by(asset,datetime) %>% 
      slice(1) %>%  # remove repeated dates
      group_by(asset) %>% 
      arrange(datetime) %>% 
      mutate(ret = ifelse(datetime-lag(datetime) > 4,
                    NA,
                    (marketprice-lag(marketprice))/lag(marketprice))
             ) %>% # ifelse check the differences of days
      arrange(asset,datetime) # show by assets and dates
    

    Output:

    # A tibble: 14 x 5
    # Groups:   asset [2]
       investor asset datetime   marketprice    ret
          <dbl> <chr> <date>           <dbl>  <dbl>
     1        1 A     2022-05-26          10 NA    
     2        1 A     2022-05-27          20  1    
     3        1 A     2022-05-28          30  0.5  
     4        2 A     2022-05-29          30  0    
     5        2 A     2022-05-30          30  0    
     6        1 A     2022-05-31          30  0    
     7        2 B     2022-05-26          20 NA    
     8        2 B     2022-05-31          10 NA    
     9        2 B     2022-06-01          20  1    
    10        2 B     2022-06-02          10 -0.5  
    11        3 B     2022-06-03          10  0    
    12        3 B     2022-06-10          30 NA    
    13        3 B     2022-06-11          20 -0.333
    14        3 B     2022-06-12          10 -0.5 
    

    2 rows dropped because a day had 3 entries of data.