I have a panel data set, where I have tens of years of data for thousands of assets.
There are also rows with NA values in RET/MV/RANK_LAG, which I would like to ignore.
I have got my data to this data frame, where I have months (in class(date)), asset codes, monthly returns and market capitalization. In addition. I have decile ranks that change every month.
Here is an example dataset. Instead of deciles I have just two ranks for the sake of this example dataset:
DATE | CODE | RET | MV | RANK_LAG |
---|---|---|---|---|
2000-01-01 | 999999 | 0.02 | 10 | 1 |
2000-01-01 | 2222X2 | -0.01 | 20 | 2 |
2000-01-01 | 123456 | 0.05 | 30 | 2 |
2000-01-01 | 5555Y5 | 0.00 | 5 | 1 |
2000-02-01 | 999999 | 0.05 | 10 | 2 |
2000-02-01 | 2222X2 | 0.10 | 20 | 2 |
2000-02-01 | 123456 | -0.1 | 30 | 1 |
2000-02-01 | 5555Y5 | -0.1 | 5 | 1 |
What I would like to do is to calculate a weighted average return (using MV as weights) for each RANK_LAG, for each DATE.
So a desired output would be:
DATE | RANK | WEIGHTED_RET |
---|---|---|
2000-01-01 | 1 | 0.013 |
2000-01-01 | 2 | 0.026 |
2000-02-01 | 1 | -0.100 |
2000-02-01 | 2 | 0.083 |
Use base function weighted.mean
in a dplyr
pipe.
df1 <- read.table(text = "DATE CODE RET MV RANK_LAG
2000-01-01 999999 0.02 10 1
2000-01-01 2222X2 -0.01 20 2
2000-01-01 123456 0.05 30 2
2000-01-01 5555Y5 0.00 5 1
2000-02-01 999999 0.05 10 2
2000-02-01 2222X2 0.10 20 2
2000-02-01 123456 -0.1 30 1
2000-02-01 5555Y5 -0.1 5 1
", header = TRUE, colClasses = c("Date", "character", rep("numeric", 3)))
suppressPackageStartupMessages(
library(dplyr)
)
df1 %>%
group_by(DATE, RANK_LAG) %>%
summarise(WEIGHTED_RET = weighted.mean(RET, MV, na.rm = TRUE), .groups = "drop")
#> # A tibble: 4 × 3
#> DATE RANK_LAG WEIGHTED_RET
#> <date> <dbl> <dbl>
#> 1 2000-01-01 1 0.0133
#> 2 2000-01-01 2 0.026
#> 3 2000-02-01 1 -0.1
#> 4 2000-02-01 2 0.0833
Created on 2023-02-07 with reprex v2.0.2