I have some text with phrases containing numbers, followed by a number of symbols. I want to extract them, for example, numbers followed by percentages. Using kwic function from quanteda package seems to work for numbers as regular expressions ("\\d{1,}"
for example).
Nevertheless, I don't find how to extract it followed by a percentage sign, using quanteda.
The following text might serve as a text example:
Thirteen (7%) of 187 patients acquired C. difficile in ICU-1, 9 (36%) of 25 on ICU-2 and 3 (5.9%) of 51 patients in BU. Eight (32%) developed diarrhoea attributable only to C. difficile and/ or toxin, and the remaining 17 (68%) were asymptomat- ic: none had pseudomembranous colitis.
The reason is that when you call kwic()
on a corpus or character object directly, it passes some arguments to tokens()
that affect how the tokenization occurs, prior to the keywords-in-context analysis. (This is documented in the ...
parameter in ?kwic
.)
The default tokenisation in quanteda uses the stringi word boundary definitions, so that:
tokens("Thirteen (7%) of 187")
# tokens from 1 document.
# text1 :
# [1] "Thirteen" "(" "7" "%" ")" "of" "187"
If you want to use a simpler, whitespace tokeniser, this can be accomplished using:
tokens("Thirteen (7%) of 187", what = "fasterword")
# tokens from 1 document.
# text1 :
# [1] "Thirteen" "(7%)" "of" "187"
So, the way to use this as you are wanting in kwic()
would be:
kwic(s, "\\d+%", valuetype = "regex", what = "fasterword")
# [text1, 2] Thirteen | (7%) | of 187 patients acquired C.
# [text1, 12] C. difficile in ICU-1, 9 | (36%) | of 25 on ICU-2 and
# [text1, 19] 25 on ICU-2 and 3 | (5.9%) | of 51 patients in BU.
# [text1, 26] 51 patients in BU. Eight | (32%) | developed diarrhoea attributable only to
# [text1, 41] toxin, and the remaining 17 | (68%) | were asymptomat- ic: none had
Otherwise, you need to wrap the regex in a phrase()
function, and separate the elements by whitespace:
kwic(s, phrase("\\d+ %"), valuetype = "regex")
# [text1, 3:4] Thirteen( | 7 % | ) of 187 patients acquired
# [text1, 18:19] in ICU-1, 9( | 36 % | ) of 25 on ICU-2
# [text1, 28:29] on ICU-2 and 3( | 5.9 % | ) of 51 patients in
# [text1, 39:40] in BU. Eight( | 32 % | ) developed diarrhoea attributable only
# [text1, 60:61] and the remaining 17( | 68 % | ) were asymptomat- ic
This behaviour might take a bit of getting used to, but it's the best way of ensuring complete user control over searching for multi-token sequences, rather than implementing a single way of determining what should be the elements of a multi-token sequence when the inputs have yet to be tokenised.