I want to use kwic() to find patterns in text with more advanced regex phrases, but I am struggling with the way kwic() is tokenising phrases and two problems evolved:
1) How to use grouping with phrases that contain whitespace:
kwic(text, pattern = phrase("\\b(address|g[eo]t? into|gotten into)\\b \\bno\\b"), valuetype="regex")
Error in stri_detect_regex(types_search, pattern, case_insensitive = case_insensitive) : Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)
2) How to look for a longer sequence of words (similar to the first question) :
kwic("this is a test", pattern= phrase("(\\w+\\s){1,3}"), valuetype="regex", remove_separator=FALSE)
kwic object with 0 rows
kwic("this is a test", pattern= phrase("(\\w+ ){0,2}"), valuetype="regex", remove_separator=FALSE)
Error in stri_detect_regex(types_search, pattern, case_insensitive = case_insensitive) : Incorrectly nested parentheses in regexp pattern. (U_REGEX_MISMATCHED_PAREN)
Thanks for any tips!
The thing to understand with phrase()
is that it makes it possible to create sequences of patterns, delimited by whitespace, as a single character value. It should not, at least for normal usage, include the whitespace delimiters as part of the pattern.
I've chosen a reproducible example for the first part of your question, which I think illustrates the point and answers your question.
Here, we simply put the different patterns into phrase()
with a space between them. This is equivalent to wrapping them inside a list()
, and making the sequence of separate patterns into elements of a character vector.
library("quanteda")
#> Package version: 2.0.1
kwic("a b c a b d e", pattern = phrase("b c|d"), valuetype = "regex")
#>
#> [text1, 2:3] a | b c | a b d e
#> [text1, 5:6] a b c a | b d | e
kwic("a b c a b d e", pattern = list(c("b", "c|d")), valuetype = "regex")
#>
#> [text1, 2:3] a | b c | a b d e
#> [text1, 5:6] a b c a | b d | e
We could also consider a vector of sequence matches, including with very inclusive matches, such as the ".+ ^a$"
below matching any sequence of 1 or more characters, followed by the token "a"
. Notice how the ^$
makes it explicit that this is the start and end of the (single-token) regex.
kwic("a b c a b d e", pattern = phrase(c("b c|d", ".+ ^a$")), valuetype = "regex")
#>
#> [text1, 2:3] a | b c | a b d e
#> [text1, 3:4] a b | c a | b d e
#> [text1, 5:6] a b c a | b d | e
For part two, you can use wildcard matching to match anything, which is easiest using the default "glob" match:
kwic("this is a test", pattern = phrase("* * *"))
#>
#> [text1, 1:3] | this is a | test
#> [text1, 2:4] this | is a test |
kwic("this is a test", pattern = phrase("* *"))
#>
#> [text1, 1:2] | this is | a test
#> [text1, 2:3] this | is a | test
#> [text1, 3:4] this is | a test |
Note finally that it is possible to include whitespace as part of a pattern match, but only if you have tokens that include whitespace. This would be true if you were to pass through the remove_separators = FALSE
argument to the tokens()
call via ...
(see ?kwic
), or if you created tokens in some other way to ensure they contain whitespace.
as.tokens(list(d1 = c("a b", " ", "c"))) %>%
kwic(phrase("\\s"), valuetype = "regex")
#>
#> [d1, 1] | a b | c
#> [d1, 2] a b | | c
There, the "a b" that is displayed is actually the single token "a b", not the sequence of tokens "a", "b". The blank on the second line is the " " token.
Created on 2020-03-31 by the reprex package (v0.3.0)