Search code examples
rdplyrtidyversedata-munging

Filtering data relative to first and last occurance of an event


I have a dataframe of an experiment, where stimulus is shown to participants, and time is measured continuously.

# reprex
df <- 
    tibble(stim = c(NA, NA, NA, NA, "a", "b", NA, "c", NA, "d", NA, NA, NA),
           time = 0:12)
# A tibble: 13 x 2
   stim   time
   <chr> <int>
 1 NA        0
 2 NA        1
 3 NA        2
 4 NA        3
 5 a         4
 6 b         5
 7 NA        6
 8 c         7
 9 NA        8
10 d         9
11 NA       10
12 NA       11
13 NA       12

I want to create a generalized solution, using tidyverse functions to drop the data 1 second before and 2 seconds after the first and last marker, respectively. Using tidyverse, I thought this will work, but it throws an uninformative error.

df %>% 
# store times for first and last stim
    mutate(first_stim = drop_na(stim) %>% pull(time) %>% first(),
           last_stim =  drop_na(stim) %>% pull(time) %>% last()) %>% 
# filter df based on new variables
    filter(time >= first(first_stim) - 1 &
           time <= first(last_stim) + 2)
Error in mutate_impl(.data, dots) : bad value

So I made a pretty ugly base r code to overcome this issue by changing the mutate:

df2 <- df %>% 
    mutate(first_stim = .[!is.na(.$stim), "time"][1,1],
           last_stim = .[!is.na(.$stim), "time"][nrow(.[!is.na(.$stim), "time"]), 1])
    # A tibble: 13 x 4
       stim   time first_stim last_stim
       <chr> <int> <tibble>   <tibble> 
     1 NA        0 4          9        
     2 NA        1 4          9        
     3 NA        2 4          9        
     4 NA        3 4          9        
     5 a         4 4          9        
     6 b         5 4          9        
     7 NA        6 4          9        
     8 c         7 4          9        
     9 NA        8 4          9        
    10 d         9 4          9        
    11 NA       10 4          9        
    12 NA       11 4          9        
    13 NA       12 4          9   

Now I would only need to filter based on the new variables first_stim - 1 and last_stim + 2. But filter fails too:

df2 %>% 
    filter(time >= first(first_stim) - 1 &
           time <= first(last_stim) + 2)
Error in filter_impl(.data, quo) : 
  Not compatible with STRSXP: [type=NULL].

I was able to do it in base R, but it is really ugly:

df2[(df2$time >= (df2[[1, "first_stim"]] - 1)) & 
    (df2$time <= (df2[[1, "last_stim"]] + 2))    
    ,]

The desired output should look like this:

# A tibble: 13 x 2
   stim   time
   <chr> <int>
 4 NA        3
 5 a         4
 6 b         5
 7 NA        6
 8 c         7
 9 NA        8
10 d         9
11 NA       10
12 NA       11

I believe that the errors are related to dplyr::nth() and related functions. And I've found some old issues that are related to this behavior, but should no longer exist https://github.com/tidyverse/dplyr/issues/1980 I would really appreciate if someone could highlight what is the problem, and how to do this in a tidy way.


Solution

  • You could use a combination of is.na and which...

    library(dplyr)
    
    df <- 
      tibble(stim = c(NA, NA, NA, NA, "a", "b", NA, "c", NA, "d", NA, NA, NA),
             time = 0:12)
    
    df %>% 
      filter(row_number() >= first(which(!is.na(stim))) - 1 & 
             row_number() <= last(which(!is.na(stim))) + 2)
    
    # # A tibble: 9 x 2
    #   stim   time
    #   <chr> <int>
    # 1 NA        3
    # 2 a         4
    # 3 b         5
    # 4 NA        6
    # 5 c         7
    # 6 NA        8
    # 7 d         9
    # 8 NA       10
    # 9 NA       11
    

    you could also make your first attempt work with a little modification...

    df %>% 
      mutate(first_stim = first(drop_na(., stim) %>% pull(time)),
             last_stim =  last(drop_na(., stim) %>% pull(time))) %>% 
      filter(time >= first(first_stim) - 1 &
               time <= first(last_stim) + 2)