I have a dataframe z for which i have 8 million daily observations. For each company (measured with seriesid) I want the last value in the month if it is available, and else the value before (within the month) for both totalreturn and close price.
I have tried using z[apply.monthly(z$date,max,by = z$seriesid)]
, however this returns only NAs. Other attempts with apply only returned me one date value (so not combined with the seriesids)
date company totalreturn close seriesid
1: 2018-01-30 x 910.2214 133.375 55860
2: 2018-02-06 x 905.9561 132.750 55860
3: 2018-02-13 x 900.8377 132.000 55860
4: 2018-02-20 x 900.8377 132.000 55860
5: 2018-02-27 x 911.0745 133.500 55860
6: 2017-03-06 y 921.3112 135.000 55940
7: 2017-03-13 y 917.8990 134.500 55940
Ideally the dataset would be showns as
date company totalreturn close seriesid
1: 2018-01-30 x 910.2214 133.375 55860
5: 2018-02-27 x 911.0745 133.500 55860
7: 2017-03-13 y 917.8990 134.500 55940
Containing one observation per company per month that is not NA
We could group_by
seriesid
and year-month and select the last row which is non-NA for both totalreturn
and close
.
library(dplyr)
df %>%
group_by(seriesid, month = format(date, "%Y%m")) %>%
slice(which.max(cumsum(!is.na(totalreturn) & !is.na(close)))) %>%
ungroup() %>%
select(-month)
# date company totalreturn close seriesid
# <date> <fct> <dbl> <dbl> <int>
#1 2018-01-30 x 910. 133. 55860
#2 2018-02-27 x 911. 134. 55860
#3 2017-03-13 y 918. 134. 55940
This is assuming your date
column is of Date type or else you need to change it to Date class first.
Or using base R ave
we can do
df1 <- df[complete.cases(df), ]
df1[unique(with(df1, ave(seq_along(date), seriesid, format(date, "%Y%m"),
FUN = function(x) tail(x, 1)))), ]