I'm trying to use a regex pattern with kwic
that doesn't match word preceded by in
, of
or and
(using a negative lookbehind), it works in regex101 but not in kwic (which uses stringi's ICU regex engine):
A reduced example is here:
library(quanteda)
tmp_tokens <- structure(list(text1 =
c(6L, 1L, 4L, 2L, 6L, 6L, 6L, 3L, 4L, 2L, 6L, 6L, 6L, 7L, 4L, 6L, 6L, 6L, 7L
)), class = "tokens", types = c("and", "or", "if", "history", "geography", "waffle", "in"
), padding = TRUE, docvars = structure(list(docname_ = "text1",
docid_ = structure(1L, levels = "text1", class = "factor"),
segid_ = 1L, fileid = 758971L), row.names = c(NA, -1L), class = "data.frame"), meta = list(
system = list(`package-version` = structure(list(c(4L, 1L,
0L)), class = c("package_version", "numeric_version")), `r-version` = structure(list(
c(4L, 4L, 1L)), class = c("R_system_version", "package_version",
"numeric_version")), system = c(sysname = "Windows", machine = "x86-64",
user = ".."), directory = "../.",
created = structure(20116, class = "Date")), object = list(
unit = "documents", what = "word", tokenizer = "tokenize_word4",
ngram = 1L, skip = 0L, concatenator = "_", summary = list(
hash = character(0), data = NULL)), user = list()))
kwic_humanities <- kwic(tmp_tokens,
pattern = phrase("(?<!in\\s)history"),
valuetype = "regex",
separator = " ")
I'd expect this to filter out the final entry:
Keyword-in-context with 3 matches.
[text1, 3] waffle and | history | or waffle waffle waffle if
[text1, 9] or waffle waffle waffle if | history | or waffle waffle waffle in
[text1, 15] or waffle waffle waffle in | history | waffle waffle waffle in
Regex patterns work only within a token, so you cannot use lookahead or lookbehind. If you want to exclude some patterns, remove them before kwic()
.
> kwic(tmp_tokens, pattern = "history")
Keyword-in-context with 3 matches.
[text1, 3] waffle and | history | or waffle waffle waffle if
[text1, 9] or waffle waffle waffle if | history | or waffle waffle waffle in
[text1, 15] or waffle waffle waffle in | history | waffle waffle waffle in
>
> tokens_remove(tmp_tokens, phrase("in history"), padding = TRUE) %>%
+ kwic(pattern = "history")
Keyword-in-context with 2 matches.
[text1, 3] waffle and | history | or waffle waffle waffle if
[text1, 9] or waffle waffle waffle if | history | or waffle waffle waffle