I have two interrelated questions with respect to pattern matching in R using the package {quanteda}
and the tokens_lookup()
function with the default valuetype="glob"
(see here and here).
Say I wanted to match a German word which can be spelt slightly differently depending on whether it is singular or plural: "Apfel" (EN: apple), "Äpfel" (EN: apples). For the plural, we thus use the umlaut "ä" instead of "a" at the beginning. So if I look up tokens, I want to make sure that whether or not I find fruits in a text does not depend on whether the word I'm lokking for is singular or plural. This is a very simple example and I'm aware that I might as well build a dictionary that features "äpfel*" and "apfel*", but my question is more generally about the use of special characters like square brackets.
So in essence, I thought I could simply go with sqaure brackets similarly to regex pattern matching: [aä]
. More generally, I thought I could use things like [a-z]
to match any single letter from a to z or [0-9]
to match any single number between 0 and 9. In fact, that's what it says here. For some reason, none of that seems to work:
library(quanteda)
text <- c(d1 = "i like apples and apple pie",
d2 = "ich mag äpfel und apfelkuchen")
dict_1 <- dictionary(list(fruits = c("[aä]pfel*"))) # EITHER "a" OR "ä"
dict_2 <- dictionary(list(fruits = c("[a-z]pfel*"))) # ANY LETTER
tokens(text) %>%
tokens_lookup(dict_1, valuetype = "glob")
tokens(text) %>%
tokens_lookup(dict_2, valuetype = "glob")
1.) Is there a way to use square brackets at all in glob pattern matching?
2.) If so, would [a-z] also match umlauts (ä,ö,ü) or if not, how can we match characters like that?
1) No, you cannot use brackets with glob pattern matching. However, they work perfectly with regex pattern matching.
2) No, [a-z] will not match umlauts.
Here's how to do it, stripping away all from your question that is not necessary to answering the question.
library("quanteda")
## Package version: 2.0.1
text <- "Ich mag Äpfel und Apfelkuchen"
toks <- tokens(text)
dict_1 <- dictionary(list(fruits = c("[aä]pfel*")))
dict_2 <- dictionary(list(fruits = c("[a-z]pfel*")))
tokens_lookup(toks, dict_1, valuetype = "regex", exclusive = FALSE)
## Tokens consisting of 1 document.
## text1 :
## [1] "Ich" "mag" "FRUITS" "und" "FRUITS"
tokens_lookup(toks, dict_2, valuetype = "regex", exclusive = FALSE)
## Tokens consisting of 1 document.
## text1 :
## [1] "Ich" "mag" "Äpfel" "und" "FRUITS"
Note: No need to import all of the tidyverse just to get %>%
, as quanteda makes this available through re-export.