I'm trying to find all strings with a combination of words/sentences with other words separating them but with a fixed limit.
Example : I want the combination of "bought" and "watch" but with, at maximum, 2 words separating them.
I haven't found anything close to what I wanted on R.
To find simple words/sentences in strings I'm using str_extract_all
from stringr
as here :
my_analysis <- str_c("\\b(", str_c(my_list_of_words_and_sentences, collapse="|"), ")\\b")
df$words_and_sentences_found <- str_extract_all(df$my_strings, my_analysis)
You can use skip-grams for this:
library(tidyverse)
library(tidytext)
df <- tibble(id = 1:3,
txt = c("I bought a beautiful and shiny watch",
"I bought a shiny watch",
"The watch is very shiny"))
tidy_ngrams <- df %>%
## use k for the skip, and n for what degree of n-gram:
unnest_tokens(ngram, txt, token = "skip_ngrams", n_min = 2, n = 2, k = 2)
tidy_ngrams
#> # A tibble: 33 × 2
#> id ngram
#> <int> <chr>
#> 1 1 i bought
#> 2 1 i a
#> 3 1 i beautiful
#> 4 1 bought a
#> 5 1 bought beautiful
#> 6 1 bought and
#> 7 1 a beautiful
#> 8 1 a and
#> 9 1 a shiny
#> 10 1 beautiful and
#> # … with 23 more rows
tidy_ngrams %>%
filter(ngram == "bought watch")
#> # A tibble: 1 × 2
#> id ngram
#> <int> <chr>
#> 1 2 bought watch
Created on 2022-06-03 by the reprex package (v2.0.1)