Search code examples
rrollapply

adjust "width" argument in rollapply() function in r for discontinuous dates


I have a dataset of daily remotely sensed data. In short, it's reflectance (values between 0 and 1) for the last 20 years. Because it's remotely sensed data, some dates do not have a value because of clouds or some other obstruction.

I want to use rollapply() in R's zoo package to detect in the time series when the values remain at 1.0 for a certain amount of time (let's say 2 weeks) or at 0 for that same amount of time.

I have code to do this, but the width argument in the rollapply() function (the 2-week threshold mentioned in the previous paragraph) looks at data points rather than time. So it looks at 14 data values rather than 14 days, which may span over a month due to the missing data values from cloud cover etc.

Here's an example:

test_data <- data.frame(date = c("2000-01-01", "2000-01-02", "2000-01-03", "2000-01-17", "2000-01-18"), 
                        value = c(0, 1, 1, 1, 0))

test_data$date <- ymd(test_data$date)

select_first_1_value <- test_data %>%
  mutate(value = rollapply(value, width = 3, min, align = "left", fill = NA, na.rm = TRUE)) %>%
  filter(value == 1) %>%
  filter(row_number() == 1) %>%
  ungroup

With the argument as width = 3, it works. It finds that 2000-01-02 is the first date where a value = 1 occurs for at least 3 values. However, if I change this to 14, it no longer works, because it only sees 5 values in this instance. Even if I wrote out an additional 10 values that equal 1 (for a total of 15), it would be incorrect because the value = 0 at 2000-01-18 and it is only counting data points and not dates.

But when we look at the dates, there are missing dates between 2000-01-03 and 2000-01-17. If both are a value = 1, then I want to extract 2000-01-02 as the first instance where the time series remains at 1 for at least 14 consecutive days. Here, I'm assuming that the values are 1 for the missing days.

Any help is greatly appreciated. Thank you.


Solution

  • There really are two problems here:

    1. How to roll by date rather than number of points.
    2. how to find the first stretch of 14 days of 1's assuming that missing dates are 1.

    Note that (2) is not readily solved by (1) because the start of the first series of ones may not be any of the listed dates! For example, suppose we change the first date to Dec 1, 1999 giving test_data2 below. Then the start of the first period of 14 ones is Dec 2, 1999. It is not any of the dates in the test_data2 series.

    test_data2 <- data.frame(
      date = c("1999-12-01", "2000-01-02", "2000-01-03", "2000-01-17", "2000-01-18"), 
      value = c(0, 1, 1, 1, 0))
    

    1) What we need to do is not roll by date but rather expand the series to fill in the missing dates giving zz and then use rollapply. Below do that by creating a zoo series (which also converts the dates to Date class) and then convert that to ts class. Because ts class can only represent regularly spaced series that conversion will fill in the missing dates and provide a value of NA for them. We can fill those in with 1 and then convert back to zoo with Date class index.

    library(zoo)
    
    z <- read.zoo(test_data2)
    zz <- z |> as.ts() |> na.fill(1) |> as.zoo() |> aggregate(as.Date)
    r <- rollapply(zz, 14, min, na.rm = TRUE, partial = TRUE, align = "left")
    time(r)[which(r == 1)[1]]
    ## [1] "1999-12-02"
    

    2) Another way to solve this not involving rollapply at all would be to use rle. Using zz from above

    ok <- with(rle(coredata(zz)), rep(lengths >= 14 & values == 1, lengths))
    tt[which(ok)[1]]
    ## [1] "1999-12-02"
    

    3) Another way without using rollapply is to extract the 0 value rows and then keep only those whose difference exceeds 14 days from the next 0 value row. Finally take the first such row and use the date one day after it. This assumes that there is at least one 0 row before the first run of 14+ ones. Below we have returned back to using test_data from the question although this would have also worked with test_data2.

    library(dplyr)
    test_data %>%
      mutate(date = as.Date(date)) %>%
      filter(value == 0) %>%
      mutate(diff = as.numeric(lead(date) - date)) %>%
      filter(diff > 14) %>%
      head(1) %>%
      mutate(date = date + 1)
    ##         date value diff
    ## 1 2000-01-02     0   17
    

    rollapply over dates rather than points

    4) The question also discussed using rollapply over dates rather than points which we address here. As noted above this does not actually solve the question of finding the first stretch of 14+ ones so instead we show how to find the first date in the series which starts a stretch of at least 14 ones. In general, we do this by first calculating a width vector using findInterval and then use rollapply in the usual way but with those widths rather than using a scalar width. This only involves one extra line of code to calculate the widths, w.

    # using test_data from question
    tt <- as.Date(test_data$date)
    
    w <- findInterval(tt + 13, tt, rightmost.closed = TRUE) - seq_along(tt) + 1
    r <- rollapply(test_data$value, w, min, fill = NA, na.rm = TRUE, align = "left")
    tt[which(r == 1)[1]]
    ## [1] "2000-01-02"
    

    There are further examples in ?rollapply showing how to roll by time rather than number of points.

    sqldf

    5) A completely different way of approaching the problem of finding the first 14+ ones with a date in the series is to use an SQL self join. It joins the first instance of test aliased to a to a second instance b associating all rows of b within the indicated date range and of a taking the minimum value of those creating a new column min14 with those minimums. The having clause then keeps only those rows for which min14 is 1 and of those the limit clause keeps the first. We then extract the date at the end.

    library(sqldf)
    
    test <- transform(test_data, date = as.Date(date))
    
    sqldf("select a.*, min(b.value) min14
      from test a
      left join test b on b.date between a.date and a.date + 13
      group by a.rowid
      having min14 = 1
      limit 1")$date
    ## [1] "2000-01-02"