I have a table with non-consecutive dates, a categorical variable and a variable (price). I want the average price per category per day. But where it gets tricky is I want that average price to be the average price of the previous date that the category had entries.
Date | category | Price |
---|---|---|
10-12-2024 | Red | 0.9 |
10-12-2024 | Red | 0.92 |
10-12-2024 | Blue | 1.23 |
10-12-2024 | Blue | 1.12 |
10-17-2024 | Blue | 0.93 |
10-17-2024 | Blue | 1.14 |
10-19-2024 | Red | 0.99 |
10-19-2024 | Blue | 1.31 |
So I would like a 4th column of average price previous trading day
for that category.
For example, row 5 and 6 would have an entry of 1.175 (average of [3:4, 3]) for this new column.
Row 7 would have an entry of 0.91 (average of [1:2, 3]).
Row 8 would have an entry of 1.035 (average of [5:6, 3]).
Date | category | Price | Average price previous trading day |
---|---|---|---|
10-12-2024 | Red | 0.9 | |
10-12-2024 | Red | 0.92 | |
10-12-2024 | Blue | 1.23 | |
10-12-2024 | Blue | 1.12 | |
10-17-2024 | Blue | 0.93 | 1.175 |
10-17-2024 | Blue | 1.14 | 1.175 |
10-19-2024 | Red | 0.99 | 0.91 |
10-19-2024 | Blue | 1.31 | 1.035 |
I am a beginner coder so my attempts were limited to the unique() function and which(). I had no progress.
Grouping by category, calculate the dense rank which gives 1 for all the first dates, 2 for the second and so on. Then take the average of the prices of the rows whose dense rank is one less than the current row. Finally remove the rank
temporary variable. The coalesce
is to convert NaN to NA. It can be omitted if NaN is ok. Note that this is strictly left to right so that if we need to change the name of DF
it only has to be done in one place and that is at the start of the pipeline. It is also reasonably compact consisting of a single mutate
.
library(dplyr)
library(purrr)
DF %>%
mutate(rank = dense_rank(as.Date(Date, "%m-%d-%Y")),
avg_price = coalesce(map_dbl(rank, ~ mean(Price[.x == rank + 1]))),
rank = NULL, .by = category)
giving
Date category Price avg_price
1 10-12-2024 Red 0.90 NA
2 10-12-2024 Red 0.92 NA
3 10-12-2024 Blue 1.23 NA
4 10-12-2024 Blue 1.12 NA
5 10-17-2024 Blue 0.93 1.175
6 10-17-2024 Blue 1.14 1.175
7 10-19-2024 Red 0.99 0.910
8 10-19-2024 Blue 1.31 1.035
The input data in reproducible form:
DF <- data.frame(
Date = rep(c("10-12-2024", "10-17-2024", "10-19-2024"), c(4L, 2L, 2L)),
category = c("Red", "Red", "Blue", "Blue", "Blue", "Blue", "Red", "Blue"),
Price = c(0.9, 0.92, 1.23, 1.12, 0.93, 1.14, 0.99, 1.31)
)