How can I create a running median of diel cycle from multiyear data?

I think this problem may be of interest to others who deal with data smoothing of long-term environmental variables.

I have a dataset structured as below:

Columns:

Date    Hour_Min    Y(response variable)

These data are hourly, and I need to create a moving average of the diel cycle, but categorized by the Hour_Min. In other words, if I were to use a 31 day window, for a given day the running average data point for Hour_Min 00:00 would take the average of the day in question with the data points from Hour_Min 00:00 for the previous and the following 15 days. This would then repeat for that day's hour 1:00, etc. through the dataframe.

Unfortunately the data also have many NAs, which is problematic for moving window averages, although I think that can be solved using rollapply from the zoo package.

One approach I tried was to use tidyr's spread function to switch from long to wide format, to create a dataframe like this:

Date    Y_Hour_Min_0000    Y_Hour_Min_0100    Y_Hour_Min_0200    etc...

If I could change the format in this way, I could then create new columns of running averages of each Y_Hour_Min_.... column. I would then need to gather everything together back to long format (another task I'm not sure how to approach).

However, I wasn't able to get the spread function to work so that it kept Date as a grouping variable associated with each Y_Hour_Min_.... column.

Another, possibly more elegant solution would be if there is a way to create a single new column in one step, using some combination of rollapply and custom function.

Any thoughts on how to implement code for this task will be greatly appreciated. Below I have a simple code to simulate my dataset:

Simulated data:

### Create vector of hours/dates:

date <- seq(as.POSIXct("2016-01-01 00:00"), as.POSIXct("2016-12-30 
23:00"), by="hour")

### Create vector of noisy sine function:

d <- 365
n <- 24*d # number of data points
t <- seq(from = 0, to = 2*d*pi, length.out=24*d)
a <- 6
b <- 1
c.norm <- rnorm(n)
amp <- 3
y <- a*sin(b*t)+c.norm*amp+15

### Randomly insert NAs into data:
ind <- which(y %in% sample(y, 1000))
y[ind]<-NA

### Create test dataframe:

df <- data.frame(dt = date, y = y) %>%
  separate(dt, c("date", "hour_min"), sep=" ") %>%
  mutate(date = as.Date(date))

Solution

I think this could work:

EDIT: Simplified code by adding fill = NA parameter to rollapply() function as suggested in the comments.

# add a complete date + time stamp
df$date_time <- paste(df$date, df$hour_min)

# make new column to store median data
df$median_y <- NA

# set rolling median width
width_roll <- 31

# do a rolling median for each hour, one at a time
# add NAs where no median can be calculated
for (i in levels(factor(df$hour_min))) {
  df[df$hour_min == i, "median_y"] <- rollapply(df[df$hour_min == i, "y"],
                                                width = width_roll,
                                                median,
                                                na.rm = TRUE,
                                                fill = NA))
}

The approach is just to use the rollapply() function as you suggested, but only on one particular hour at a time. Then each of these is placed back into a new column in turn.

Here's an example for just one hour over the whole year, which makes it easier to visualize the median smoothing.

# Examples:

# plot one hour plus rolling median over time
# here i = "23:00:00"
plot(x = as.POSIXct(df[df$hour_min == i, "date_time"]),
     y = df[df$hour_min == i, "y"],
     type = "l",
     col = "blue",
     ylab = "y values",
     xlab = i)
lines(x = as.POSIXct(df[df$hour_min == i, "date_time"]),
      y = df[df$hour_min == i, "median_y"],
      lwd = 3)
legend("topleft", 
       legend = c("raw", "median"), 
       col = c("blue", "black"), 
       lwd = 3)

Plot for a single hour

This is for everything (lots of data so not so easy to see but looks like it worked).

# plot all the data
plot(x = as.POSIXct(df$date_time),
     y = df$y,
     type = "l",
     col = "blue",
     ylab = "y values",
     xlab = "Date")
lines(x = as.POSIXct(df$date_time),
      y = df$median_y,
      lwd = 3)
legend("topleft", 
       legend = c("raw", "median"), 
       col = c("blue", "black"), 
       lwd = 3)

Plot for all data