Search code examples
rtext-miningstringrtidytext

Is there a way in R to find a combination of words (or sentences) within a certain range in a string


I'm trying to find all strings with a combination of words/sentences with other words separating them but with a fixed limit.

Example : I want the combination of "bought" and "watch" but with, at maximum, 2 words separating them.

  • I bought a beautiful and shiny watch -> not ok because there is 4 words between "bought" and "watch" ("a beautiful and shiny")
  • I bought a shiny watch -> ok because there is 2 words between "bought" and "watch" ("a shiny")

I haven't found anything close to what I wanted on R.

To find simple words/sentences in strings I'm using str_extract_all from stringr as here :

my_analysis <- str_c("\\b(", str_c(my_list_of_words_and_sentences, collapse="|"), ")\\b")
df$words_and_sentences_found <- str_extract_all(df$my_strings, my_analysis)

Solution

  • You can use skip-grams for this:

    library(tidyverse)
    library(tidytext)
    
    df <- tibble(id = 1:3,
                 txt = c("I bought a beautiful and shiny watch", 
                         "I bought a shiny watch", 
                         "The watch is very shiny"))
    
    tidy_ngrams <- df %>%
      ## use k for the skip, and n for what degree of n-gram:
      unnest_tokens(ngram, txt, token = "skip_ngrams", n_min = 2, n = 2, k = 2) 
    
    tidy_ngrams
    #> # A tibble: 33 × 2
    #>       id ngram           
    #>    <int> <chr>           
    #>  1     1 i bought        
    #>  2     1 i a             
    #>  3     1 i beautiful     
    #>  4     1 bought a        
    #>  5     1 bought beautiful
    #>  6     1 bought and      
    #>  7     1 a beautiful     
    #>  8     1 a and           
    #>  9     1 a shiny         
    #> 10     1 beautiful and   
    #> # … with 23 more rows
    
    tidy_ngrams %>%
      filter(ngram == "bought watch")
    #> # A tibble: 1 × 2
    #>      id ngram       
    #>   <int> <chr>       
    #> 1     2 bought watch
    

    Created on 2022-06-03 by the reprex package (v2.0.1)