In a text document, I want to count the instances when uncertainty|unclear has occurred at a distance of 1 to 30 words from global|decrease in demand|fall in demand. However, my code as below seems to be insensitive to {1,30} as changing these values doesn't change the output. Any help would be appreciated.
str_count(texttw,"\\buncertainty|unclear(?:\\W+\\w+){1,30} ?\\W+global|decrease in demand|fall in demand\\b"))
I am not sure if the typo in your text was on purpose ("uncertainy" instead of "uncertainty") so I corrected it, but try something like this:
library(stringr)
x <- "uncertainty negatively influences economic agents investment and business decisions which leads to decrease in demand. When the economic environment is fraught with uncertainty and the future is unclear businesses and firms may hold back their decisions until uncertainty subsides. Ever since the start of the pandemic global economic outlook has been unclear with unprecedented uncertainty leading to fall in demand."
regex <- "(uncertainty|unclear)\\s(\\w+\\s){1,30}(global|decrease in demand|fall in demand)"
str_count(x, regex)
# [1] 2
str_extract_all(x, regex)
# [[1]]
# [1] "uncertainty negatively influences economic agents investment and business decisions which leads to decrease in demand"
# [2] "unclear with unprecedented uncertainty leading to fall in demand"
|
) unclear\\s
+
) a word characters \\w
(A-Z, a-z, _) and a space \\s
. This pattern should be matched between one and thirty times {1,30}
Technically, all of the capture groups could be made non capture groups with ?:
since you do not need to back reference or capture them specifically.
In the text you posted you have an interesting case in the last sentence, "Ever since the start of the pandemic global economic outlook has been unclear with unprecedented uncertainty leading to fall in demand."
Depending on your interpretation this could actually have two matches:
If this was your interpretation then the text you posted should have three, not two matches.
Just a note to clarify:
"uncertainty subsides. Ever since the start of the pandemic global economic outlook has been unclear with unprecedented uncertainty leading to fall in demand." is not a match because of the period after "subsides".