Search code examples
rdplyrtidyrzoosmoothing

How can I create a running median of diel cycle from multiyear data?


I think this problem may be of interest to others who deal with data smoothing of long-term environmental variables.

I have a dataset structured as below:

Columns:

Date    Hour_Min    Y(response variable)

These data are hourly, and I need to create a moving average of the diel cycle, but categorized by the Hour_Min. In other words, if I were to use a 31 day window, for a given day the running average data point for Hour_Min 00:00 would take the average of the day in question with the data points from Hour_Min 00:00 for the previous and the following 15 days. This would then repeat for that day's hour 1:00, etc. through the dataframe.

Unfortunately the data also have many NAs, which is problematic for moving window averages, although I think that can be solved using rollapply from the zoo package.

One approach I tried was to use tidyr's spread function to switch from long to wide format, to create a dataframe like this:

Date    Y_Hour_Min_0000    Y_Hour_Min_0100    Y_Hour_Min_0200    etc...

If I could change the format in this way, I could then create new columns of running averages of each Y_Hour_Min_.... column. I would then need to gather everything together back to long format (another task I'm not sure how to approach).

However, I wasn't able to get the spread function to work so that it kept Date as a grouping variable associated with each Y_Hour_Min_.... column.

Another, possibly more elegant solution would be if there is a way to create a single new column in one step, using some combination of rollapply and custom function.

Any thoughts on how to implement code for this task will be greatly appreciated. Below I have a simple code to simulate my dataset:

Simulated data:

### Create vector of hours/dates:

date <- seq(as.POSIXct("2016-01-01 00:00"), as.POSIXct("2016-12-30 
23:00"), by="hour")

### Create vector of noisy sine function:

d <- 365
n <- 24*d # number of data points
t <- seq(from = 0, to = 2*d*pi, length.out=24*d)
a <- 6
b <- 1
c.norm <- rnorm(n)
amp <- 3
y <- a*sin(b*t)+c.norm*amp+15

### Randomly insert NAs into data:
ind <- which(y %in% sample(y, 1000))
y[ind]<-NA

### Create test dataframe:

df <- data.frame(dt = date, y = y) %>%
  separate(dt, c("date", "hour_min"), sep=" ") %>%
  mutate(date = as.Date(date))

Solution

  • I think this could work:

    EDIT: Simplified code by adding fill = NA parameter to rollapply() function as suggested in the comments.

    # add a complete date + time stamp
    df$date_time <- paste(df$date, df$hour_min)
    
    # make new column to store median data
    df$median_y <- NA
    
    # set rolling median width
    width_roll <- 31
    
    # do a rolling median for each hour, one at a time
    # add NAs where no median can be calculated
    for (i in levels(factor(df$hour_min))) {
      df[df$hour_min == i, "median_y"] <- rollapply(df[df$hour_min == i, "y"],
                                                    width = width_roll,
                                                    median,
                                                    na.rm = TRUE,
                                                    fill = NA))
    }
    

    The approach is just to use the rollapply() function as you suggested, but only on one particular hour at a time. Then each of these is placed back into a new column in turn.

    Here's an example for just one hour over the whole year, which makes it easier to visualize the median smoothing.

    # Examples:
    
    # plot one hour plus rolling median over time
    # here i = "23:00:00"
    plot(x = as.POSIXct(df[df$hour_min == i, "date_time"]),
         y = df[df$hour_min == i, "y"],
         type = "l",
         col = "blue",
         ylab = "y values",
         xlab = i)
    lines(x = as.POSIXct(df[df$hour_min == i, "date_time"]),
          y = df[df$hour_min == i, "median_y"],
          lwd = 3)
    legend("topleft", 
           legend = c("raw", "median"), 
           col = c("blue", "black"), 
           lwd = 3)
    

    Plot for a single hour

    This is for everything (lots of data so not so easy to see but looks like it worked).

    # plot all the data
    plot(x = as.POSIXct(df$date_time),
         y = df$y,
         type = "l",
         col = "blue",
         ylab = "y values",
         xlab = "Date")
    lines(x = as.POSIXct(df$date_time),
          y = df$median_y,
          lwd = 3)
    legend("topleft", 
           legend = c("raw", "median"), 
           col = c("blue", "black"), 
           lwd = 3)
    

    Plot for all data