Further edit to original question.
Question originated by expectation that regexes would work identically or nearly to "grep" or to some programming language. This below is what I expected and the fact that it did not happen generated my question (using cygwin):
echo "regex unusual operation will deport into a different" > out.txt
grep "will * dep" out.txt
"regex unusual operation will deport into a different"
kwic(immigCorpus, "deport", window = 3)
Its output is -
[BNP, 157] The BNP will | deport | all foreigners convicted
[BNP, 1946] . 2. | Deport | all illegal immigrants
[BNP, 1952] immigrants We shall | deport | all illegal immigrants
[BNP, 2585] Criminals We shall | deport | all criminal entrants
To try/learn the basics I execute
kwic(immigCorpus, "will *depo", window = 3, valuetype = "regex")
expecting to get
[BNP, 157] The BNP will | deport | all foreigners convicted
but I get:
kwic object with 0 rows
Similar attempts like
kwic(immigCorpus, ".*will *depo.*", window = 3, valuetype = "regex")
Get the same result:
kwic object with 0 rows
Why is that? Tokenization? if so how should I write the regex?
PS Thanks for this great package
You are trying to match a phrase with your pattern. By default, the pattern
argument is treated as a space separated list of keywords, and the search is performed against this list. So, you may get your expected result using
> kwic(immigCorpus, phrase("will deport"), window = 3)
[BNP, 156:157] - The BNP | will deport | all foreigners convicted
A valuetype = "regex"
makes sense if you are using a regex. E.g. to get both shall
and will deport
use
> kwic(immigCorpus, phrase("(will|shall) deport"), window = 3, valuetype = "regex")
[BNP, 156:157] - The BNP | will deport | all foreigners convicted
[BNP, 1951:1952] illegal immigrants We | shall deport | all illegal immigrants
[BNP, 2584:2585] Foreign Criminals We | shall deport | all criminal entrants