Search code examples
rdplyrlaglead

lead or lag function to get several values, not just the nth


I have a tibble with a list of words for each row. I want to create a new variable from a function that searches for a keyword and, if it finds the keyword, creates a string composed of the keyword plus-and-minus 3 words.

The code below is close, but, rather than grabbing all three words before and after my keyword, it grabs the single word 3 ahead/behind.

df <- tibble(words = c("it", "was", "the", "best", "of", "times", 
                       "it", "was", "the", "worst", "of", "times"))
df <- df %>% mutate(chunks = ifelse(words=="times", 
                                    paste(lag(words, 3), 
                                          words, 
                                          lead(words, 3), sep = " "),
                                    NA))

The most intuitive solution would be if the lag function could do something like this: lead(words, 1:3) but that doesn't work.

Obviously I could pretty quickly do this by hand (paste(lead(words,3), lead(words,2), lead(words,1),...lag(words,3)), but I'll eventually actually want to be able to grab the keyword plus-and-minus 50 words--too much to hand-code.

Would be ideal if a solution existed in the tidyverse, but any solution would be helpful. Any help would be appreciated.


Solution

  • One option would be sapply:

    library(dplyr)
    
    df %>%
      mutate(
        chunks = ifelse(
          words == "times",
          sapply(
            1:nrow(.),
            function(x) paste(words[pmax(1, x - 3):pmin(x + 3, nrow(.))], collapse = " ")
            ),
          NA
          )
      )
    

    Output:

    # A tibble: 12 x 2
       words chunks                      
       <chr> <chr>                       
     1 it    NA                          
     2 was   NA                          
     3 the   NA                          
     4 best  NA                          
     5 of    NA                          
     6 times the best of times it was the
     7 it    NA                          
     8 was   NA                          
     9 the   NA                          
    10 worst NA                          
    11 of    NA                          
    12 times the worst of times   
    

    Although not an explicit lead or lag function, it can often serve the purpose as well.