Search code examples
rfor-loopnested-for-loop

How to make a for loop with two indexes?


Dear StackOverflow users,

I am struggling to implement a for loop. I have a dataframe with a column Time (YMD-HMS format) and another column with particulate matter data. Furthermore, I have a dataframe with start and stop moments;

#TIMEPOINTS log

start <- c(ymd_hms("2020-03-06 19:43:00",
                   "2020-03-06 19:47:00",
                   "2020-03-06 19:53:00",
                   "2020-03-06 20:00:00",
                   "2020-03-06 20:13:00",
                   "2020-03-06 20:22:00",
                   "2020-03-06 20:32:00",
                   "2020-03-06 20:36:00",
                   "2020-03-06 20:42:00",
                   "2020-03-06 20:45:00",
                   "2020-03-06 20:49:00",
                   "2020-03-06 21:01:00",
                   "2020-03-06 21:04:00",
                   "2020-03-06 21:06:00",
                   "2020-03-06 21:09:00",
                   "2020-03-06 21:12:00"))

end <- c(ymd_hms("2020-03-06 19:46:00",
                 "2020-03-06 19:49:00",
                 "2020-03-06 19:55:00",
                 "2020-03-06 20:02:00",
                 "2020-03-06 20:15:00",
                 "2020-03-06 20:24:00",
                 "2020-03-06 20:34:00",
                 "2020-03-06 20:38:00",
                 "2020-03-06 20:44:00",
                 "2020-03-06 20:47:00",
                 "2020-03-06 20:51:00",
                 "2020-03-06 21:03:00",
                 "2020-03-06 21:06:00",
                 "2020-03-06 21:08:00",
                 "2020-03-06 21:11:00",
                 "2020-03-06 21:14:00"))

df <- data.frame(start, end)

I wish to create a new dataframe with all datapoints without these specific timepoints, like this; (but than using a forloop, iterating over the various starting and end points).

dat2 <- dat %>% .[.[["Time"]] >= df$start[1],] %>%
    .[.[["Time"]] <= df$end[1],]

I know this can be done using a for loop and I tried to figure it out for my case, but I'm a bit lost..

Any help is highly appreciated!


Solution

  • To start with, I’d clean up your current code slightly:

    dat2 <- dat %>% .[.$Time >= df$start[1] && .$Time <= df$end[1],]
    

    By using &&, you’ve reduced two subset operations into one. And using $… reduces clutter compared to [["…"]] in this case.

    Next, I suggest extracting this condition into a function (in fact that function already exists in the ‘dplyr’ package: between). This allows us to write the code as

    dat2 <- dat %>% filter(between(Time, df$start[1], df$end[1]))
    

    Now we want to vectorise this to check for overlap with any interval:

    dat2 <- dat %>% filter(between_any(Time, df$start, df$end))
    

    Now we need to write that between_any function. Let’s start by implementing it for a single query value:

    between_any1 = function (x, left, right) {
        any(x >= left & x <=  right)
    }
    

    Note the use of & here, instead of &&; this is because we vectorised over left and right, and & is the vectorised version of &&. That is, 4 >= (1 : 3) & 4 <= (3 : 5) results in c(FALSE, TRUE, TRUE).

    Now we need to make this work when x is a vector. We could use the base R function Vectorize but I generally find it better to do it manually:

    between_any = function (x, left, right) {
        map_lgl(x, ~ any(.x >= left & .x <= right))
    }
    

    This uses ‘purrr’, but we could just as well have used lapply or vapply.

    Oh, and it sounds like you wanted to filter out times falling into the ranges in your df, so you need to invert the condition for filter:

    dat2 <- dat %>% filter(! between_any(Time, df$from, df$to))