Search code examples
rregexicuquanteda

R quanteda kwic not matching negative look behind pattern


I'm trying to use a regex pattern with kwic that doesn't match word preceded by in, of or and (using a negative lookbehind), it works in regex101 but not in kwic (which uses stringi's ICU regex engine):

A reduced example is here:

library(quanteda)
tmp_tokens <- structure(list(text1 = 
c(6L, 1L, 4L, 2L, 6L, 6L, 6L, 3L, 4L, 2L, 6L, 6L, 6L, 7L, 4L, 6L, 6L, 6L, 7L
)), class = "tokens", types = c("and", "or", "if", "history", "geography", "waffle", "in"
), padding = TRUE, docvars = structure(list(docname_ = "text1", 
    docid_ = structure(1L, levels = "text1", class = "factor"), 
    segid_ = 1L, fileid = 758971L), row.names = c(NA, -1L), class = "data.frame"), meta = list(
    system = list(`package-version` = structure(list(c(4L, 1L, 
    0L)), class = c("package_version", "numeric_version")), `r-version` = structure(list(
        c(4L, 4L, 1L)), class = c("R_system_version", "package_version", 
    "numeric_version")), system = c(sysname = "Windows", machine = "x86-64", 
    user = ".."), directory = "../.", 
        created = structure(20116, class = "Date")), object = list(
        unit = "documents", what = "word", tokenizer = "tokenize_word4", 
        ngram = 1L, skip = 0L, concatenator = "_", summary = list(
            hash = character(0), data = NULL)), user = list()))

  kwic_humanities <- kwic(tmp_tokens, 
                          pattern = phrase("(?<!in\\s)history"), 
                          valuetype = "regex",
                          separator = " ")

I'd expect this to filter out the final entry:

Keyword-in-context with 3 matches.                                                                              
  [text1, 3]                 waffle and | history | or waffle waffle waffle if
  [text1, 9] or waffle waffle waffle if | history | or waffle waffle waffle in
  [text1, 15] or waffle waffle waffle in | history | waffle waffle waffle in  

Solution

  • Regex patterns work only within a token, so you cannot use lookahead or lookbehind. If you want to exclude some patterns, remove them before kwic().

    > kwic(tmp_tokens, pattern = "history")
    Keyword-in-context with 3 matches.                                                                              
      [text1, 3]                 waffle and | history | or waffle waffle waffle if
      [text1, 9] or waffle waffle waffle if | history | or waffle waffle waffle in
     [text1, 15] or waffle waffle waffle in | history | waffle waffle waffle in   
    
    > 
    > tokens_remove(tmp_tokens, phrase("in history"), padding = TRUE) %>% 
    + kwic(pattern = "history")
    Keyword-in-context with 2 matches.                                                                             
     [text1, 3]                 waffle and | history | or waffle waffle waffle if
     [text1, 9] or waffle waffle waffle if | history | or waffle waffle waffle