Search code examples
rregextextnlpstringr

In R Str_count: Counting occurrences of words at a certain distance e.g. 1 to 30 words apart


In a text document, I want to count the instances when uncertainty|unclear has occurred at a distance of 1 to 30 words from global|decrease in demand|fall in demand. However, my code as below seems to be insensitive to {1,30} as changing these values doesn't change the output. Any help would be appreciated.

str_count(texttw,"\\buncertainty|unclear(?:\\W+\\w+){1,30} ?\\W+global|decrease in demand|fall in demand\\b"))

Solution

  • I am not sure if the typo in your text was on purpose ("uncertainy" instead of "uncertainty") so I corrected it, but try something like this:

    library(stringr)
    
    x <- "uncertainty negatively influences economic agents investment and business decisions which leads to decrease in demand. When the economic environment is fraught with uncertainty and the future is unclear businesses and firms may hold back their decisions until uncertainty subsides. Ever since the start of the pandemic global economic outlook has been unclear with unprecedented uncertainty leading to fall in demand."
    
    regex <- "(uncertainty|unclear)\\s(\\w+\\s){1,30}(global|decrease in demand|fall in demand)"
    
    str_count(x, regex)
    # [1] 2
    
    str_extract_all(x, regex)
    # [[1]]
    # [1] "uncertainty negatively influences economic agents investment and business decisions which leads to decrease in demand"
    # [2] "unclear with unprecedented uncertainty leading to fall in demand"    
    
    • Begin matching when you find the words uncertainty OR (|) unclear
    • The word should be followed by a space \\s
    • That space should be followed by one or more (+) a word characters \\w (A-Z, a-z, _) and a space \\s. This pattern should be matched between one and thirty times {1,30}
    • Followed by the phrase global OR decrease in demand OR fall in demand

    Technically, all of the capture groups could be made non capture groups with ?: since you do not need to back reference or capture them specifically.

    In the text you posted you have an interesting case in the last sentence, "Ever since the start of the pandemic global economic outlook has been unclear with unprecedented uncertainty leading to fall in demand."

    Depending on your interpretation this could actually have two matches:

    1. unclear with unprecedented uncertainty leading to fall in demand
    2. uncertainty leading to fall in demand

    If this was your interpretation then the text you posted should have three, not two matches.

    Just a note to clarify:

    "uncertainty subsides. Ever since the start of the pandemic global economic outlook has been unclear with unprecedented uncertainty leading to fall in demand." is not a match because of the period after "subsides".