Search code examples
rregexquanteda

kwic in quanteda (R) does not identify more than one word in regex pattern


I am trying to identify regex patterns in text, but kwic() does not identify regex phrases that are longer than just one word. I tried to use phrase(), but that did not work either.

To give you an example:

mycorpus = corpus(bla$`TEXT` )
foo = kwic(mycorpus, pattern = "\\bno\\b", window = 10, valuetype = "regex" ) #gives 1959 obs. 
foo = kwic(mycorpus, pattern = "\\bno\\b\\s{0,5}\\w+", window = 10, valuetype = "regex" ) #gives 0 obs.
foo = kwic(mycorpus, pattern = "no\\sother", window = 10, valuetype = "regex" ) #gives 0 obs. even though it should find 3 phrases

even though there are multiple patterns in the text that should be identified.

Thanks for the help!


Solution

  • That's because kwic searches tokens, and tokens no longer contain spaces. To search for a sequence of tokens, what quanteda treats as a "phrase", wrap the pattern in phrase(). (See also ?phrase.)

    library("quanteda")
    ## Package version: 2.0.0
    
    txt <- "one two three four five"
    
    # no match
    kwic(txt, "one\\stwo", valuetype = "regex", window = 1)
    ## kwic object with 0 rows
    
    # match
    kwic(txt, phrase("one two"), valuetype = "regex", window = 1)
    ##                                 
    ##  [text1, 1:2]  | one two | three