I am trying to identify regex patterns in text, but kwic() does not identify regex phrases that are longer than just one word. I tried to use phrase()
, but that did not work either.
To give you an example:
mycorpus = corpus(bla$`TEXT` )
foo = kwic(mycorpus, pattern = "\\bno\\b", window = 10, valuetype = "regex" ) #gives 1959 obs.
foo = kwic(mycorpus, pattern = "\\bno\\b\\s{0,5}\\w+", window = 10, valuetype = "regex" ) #gives 0 obs.
foo = kwic(mycorpus, pattern = "no\\sother", window = 10, valuetype = "regex" ) #gives 0 obs. even though it should find 3 phrases
even though there are multiple patterns in the text that should be identified.
Thanks for the help!
That's because kwic searches tokens, and tokens no longer contain spaces. To search for a sequence of tokens, what quanteda treats as a "phrase", wrap the pattern in phrase()
. (See also ?phrase
.)
library("quanteda")
## Package version: 2.0.0
txt <- "one two three four five"
# no match
kwic(txt, "one\\stwo", valuetype = "regex", window = 1)
## kwic object with 0 rows
# match
kwic(txt, phrase("one two"), valuetype = "regex", window = 1)
##
## [text1, 1:2] | one two | three