Search code examples
rregexquanteda

Searching for advanced regex patterns with kwic()


I want to use kwic() to find patterns in text with more advanced regex phrases, but I am struggling with the way kwic() is tokenising phrases and two problems evolved:

1) How to use grouping with phrases that contain whitespace:

kwic(text, pattern = phrase("\\b(address|g[eo]t? into|gotten into)\\b \\bno\\b"), valuetype="regex")

Error in stri_detect_regex(types_search, pattern, case_insensitive = case_insensitive) : Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

2) How to look for a longer sequence of words (similar to the first question) :

kwic("this is a test", pattern= phrase("(\\w+\\s){1,3}"), valuetype="regex", remove_separator=FALSE)

kwic object with 0 rows

kwic("this is a test", pattern= phrase("(\\w+ ){0,2}"), valuetype="regex", remove_separator=FALSE)

Error in stri_detect_regex(types_search, pattern, case_insensitive = case_insensitive) : Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)

Thanks for any tips!


Solution

  • The thing to understand with phrase() is that it makes it possible to create sequences of patterns, delimited by whitespace, as a single character value. It should not, at least for normal usage, include the whitespace delimiters as part of the pattern.

    I've chosen a reproducible example for the first part of your question, which I think illustrates the point and answers your question.

    Here, we simply put the different patterns into phrase() with a space between them. This is equivalent to wrapping them inside a list(), and making the sequence of separate patterns into elements of a character vector.

    library("quanteda")
    #> Package version: 2.0.1
    
    kwic("a b c a b d e", pattern = phrase("b c|d"), valuetype = "regex")
    #>                                      
    #>  [text1, 2:3]       a | b c | a b d e
    #>  [text1, 5:6] a b c a | b d | e
    kwic("a b c a b d e", pattern = list(c("b", "c|d")), valuetype = "regex")
    #>                                      
    #>  [text1, 2:3]       a | b c | a b d e
    #>  [text1, 5:6] a b c a | b d | e
    

    We could also consider a vector of sequence matches, including with very inclusive matches, such as the ".+ ^a$" below matching any sequence of 1 or more characters, followed by the token "a". Notice how the ^$ makes it explicit that this is the start and end of the (single-token) regex.

    kwic("a b c a b d e", pattern = phrase(c("b c|d", ".+ ^a$")), valuetype = "regex")
    #>                                      
    #>  [text1, 2:3]       a | b c | a b d e
    #>  [text1, 3:4]     a b | c a | b d e  
    #>  [text1, 5:6] a b c a | b d | e
    

    For part two, you can use wildcard matching to match anything, which is easiest using the default "glob" match:

    kwic("this is a test", pattern = phrase("* * *"))
    #>                                      
    #>  [text1, 1:3]      | this is a | test
    #>  [text1, 2:4] this | is a test |
    
    kwic("this is a test", pattern = phrase("* *"))
    #>                                         
    #>  [text1, 1:2]         | this is | a test
    #>  [text1, 2:3]    this |  is a   | test  
    #>  [text1, 3:4] this is | a test  |
    

    Note finally that it is possible to include whitespace as part of a pattern match, but only if you have tokens that include whitespace. This would be true if you were to pass through the remove_separators = FALSE argument to the tokens() call via ... (see ?kwic), or if you created tokens in some other way to ensure they contain whitespace.

    as.tokens(list(d1 = c("a b", " ", "c"))) %>%
      kwic(phrase("\\s"), valuetype = "regex")
    #>                        
    #>  [d1, 1]     | a b |  c
    #>  [d1, 2] a b |     | c
    

    There, the "a b" that is displayed is actually the single token "a b", not the sequence of tokens "a", "b". The blank on the second line is the " " token.

    Created on 2020-03-31 by the reprex package (v0.3.0)