Search code examples
rstringstringrstringi

Get context around extracted word


I have extracted keywords from a dataframe of sentences. I need to get a few words pre- and post- keyword to understand the context and be able to do some basic counts.

I have tried multiple stringr and stringi functions and grepl functions others suggested on SO for similar questions. However, not finding anything that works for my situation.

Below is what I'd like. Assume it is a dataframe or tibble with the first two fields listed. I need/want to create the rightmost column (keyword_w_context).

In the example, I'm pulling the three words that proceed the keyword. But, I would want to modify whatever solution so I can get 1, 2, n. Would also be nice if I could do post word in the same way.

Basically, wanting to do something like a mutate that creates a new variable with the context words (before/after, see below) around the keyword.

Sentence Keyword Keyword_w_context
The yellow lab dog is so cute. dog The yellow lab dog
The fluffy black cat purrs loudly. cat The fluffy black cat

Many thanks!


Solution

  • You probably want to take a natural language processing (NLP) approach rather than something based on regular expressions. There are many frameworks for this. An easy enough one is tidytext. Here is an example on how to grab a bunch of words surrounding your keywords.

    You will probably want to play around with this to get what you want. It sounds like you want several things out of this, so I somewhat just picked one.

    library(tidytext)
    library(dplyr)
    library(tibble)
    
    df <- tibble(Sentence = c("The yellow lab dog is so cute.",
                              "The fluffy black cat purrs loudly."))
    keywords <- tibble(word = c("dog", "cat"), keyword = TRUE)
    
    df %>% 
      rowid_to_column() %>% 
      unnest_tokens("trigram", Sentence, token = "ngrams", n = 3, n_min = 2) %>%
      unnest_tokens("word", trigram, drop = FALSE) %>% 
      left_join(keywords, by = "word") %>% 
      filter(keyword)
    
    # A tibble: 10 x 4
       rowid trigram          word  keyword
       <int> <chr>            <chr> <lgl>  
     1     1 yellow lab dog   dog   TRUE   
     2     1 lab dog          dog   TRUE   
     3     1 lab dog is       dog   TRUE   
     4     1 dog is           dog   TRUE   
     5     1 dog is so        dog   TRUE   
     6     2 fluffy black cat cat   TRUE   
     7     2 black cat        cat   TRUE   
     8     2 black cat purrs  cat   TRUE   
     9     2 cat purrs        cat   TRUE   
    10     2 cat purrs loudly cat   TRUE
    

    An example of how you can build on this is something like as follows. Here you can track what sentence and in what position from the n-gram you found each word. So you can filter where the keyword is the 1st word_pos or whatever.

    df %>% 
      rowid_to_column("sentence_id") %>% 
      unnest_tokens("trigram", Sentence, token = "ngrams", n = 3, n_min = 3) %>%
      rowid_to_column("trigram_id") %>% 
      unnest_tokens("word", trigram, drop = FALSE) %>% 
      group_by(trigram_id) %>% 
      mutate(word_pos = row_number()) %>% 
      left_join(keywords, by = "word") %>%
      relocate(sentence_id, trigram_id, word_pos, trigram, word) %>% 
      filter(keyword, word_pos == 1)
    
    # A tibble: 2 x 6
    # Groups:   trigram_id [2]
      sentence_id trigram_id word_pos trigram          word  keyword
            <int>      <int>    <int> <chr>            <chr> <lgl>  
    1           1          4        1 dog is so        dog   TRUE   
    2           2          9        1 cat purrs loudly cat   TRUE