Search code examples
rdataframeaverage

How do I calculate the averages of values per date and category and display them lagged?


I have a table with non-consecutive dates, a categorical variable and a variable (price). I want the average price per category per day. But where it gets tricky is I want that average price to be the average price of the previous date that the category had entries.

Date category Price
10-12-2024 Red 0.9
10-12-2024 Red 0.92
10-12-2024 Blue 1.23
10-12-2024 Blue 1.12
10-17-2024 Blue 0.93
10-17-2024 Blue 1.14
10-19-2024 Red 0.99
10-19-2024 Blue 1.31

So I would like a 4th column of average price previous trading day for that category. For example, row 5 and 6 would have an entry of 1.175 (average of [3:4, 3]) for this new column. Row 7 would have an entry of 0.91 (average of [1:2, 3]). Row 8 would have an entry of 1.035 (average of [5:6, 3]).

Date category Price Average price previous trading day
10-12-2024 Red 0.9
10-12-2024 Red 0.92
10-12-2024 Blue 1.23
10-12-2024 Blue 1.12
10-17-2024 Blue 0.93 1.175
10-17-2024 Blue 1.14 1.175
10-19-2024 Red 0.99 0.91
10-19-2024 Blue 1.31 1.035

I am a beginner coder so my attempts were limited to the unique() function and which(). I had no progress.


Solution

  • Grouping by category, calculate the dense rank which gives 1 for all the first dates, 2 for the second and so on. Then take the average of the prices of the rows whose dense rank is one less than the current row. Finally remove the rank temporary variable. The coalesce is to convert NaN to NA. It can be omitted if NaN is ok. Note that this is strictly left to right so that if we need to change the name of DF it only has to be done in one place and that is at the start of the pipeline. It is also reasonably compact consisting of a single mutate.

    library(dplyr)
    library(purrr)
    
    DF %>%
      mutate(rank = dense_rank(as.Date(Date, "%m-%d-%Y")), 
         avg_price = coalesce(map_dbl(rank, ~ mean(Price[.x == rank + 1]))),
         rank = NULL, .by = category)
    

    giving

            Date category Price avg_price
    1 10-12-2024      Red  0.90        NA
    2 10-12-2024      Red  0.92        NA
    3 10-12-2024     Blue  1.23        NA
    4 10-12-2024     Blue  1.12        NA
    5 10-17-2024     Blue  0.93     1.175
    6 10-17-2024     Blue  1.14     1.175
    7 10-19-2024      Red  0.99     0.910
    8 10-19-2024     Blue  1.31     1.035
    

    Note

    The input data in reproducible form:

    DF <- data.frame(
      Date = rep(c("10-12-2024", "10-17-2024", "10-19-2024"), c(4L, 2L, 2L)),
      category = c("Red", "Red", "Blue", "Blue", "Blue", "Blue", "Red", "Blue"),
      Price = c(0.9, 0.92, 1.23, 1.12, 0.93, 1.14, 0.99, 1.31)
    )