Search code examples
rdictionarypattern-matchingglobquanteda

Quanteda: How can I use square brackets with glob-style pattern matching using tokens_lookup?


I have two interrelated questions with respect to pattern matching in R using the package {quanteda} and the tokens_lookup() function with the default valuetype="glob" (see here and here).

Say I wanted to match a German word which can be spelt slightly differently depending on whether it is singular or plural: "Apfel" (EN: apple), "Äpfel" (EN: apples). For the plural, we thus use the umlaut "ä" instead of "a" at the beginning. So if I look up tokens, I want to make sure that whether or not I find fruits in a text does not depend on whether the word I'm lokking for is singular or plural. This is a very simple example and I'm aware that I might as well build a dictionary that features "äpfel*" and "apfel*", but my question is more generally about the use of special characters like square brackets.

So in essence, I thought I could simply go with sqaure brackets similarly to regex pattern matching: [aä]. More generally, I thought I could use things like [a-z] to match any single letter from a to z or [0-9] to match any single number between 0 and 9. In fact, that's what it says here. For some reason, none of that seems to work:

library(quanteda)

text <- c(d1 = "i like apples and apple pie", 
          d2 = "ich mag äpfel und apfelkuchen")

dict_1 <- dictionary(list(fruits = c("[aä]pfel*")))      # EITHER "a" OR "ä"
dict_2 <- dictionary(list(fruits = c("[a-z]pfel*")))     # ANY LETTER

tokens(text) %>%
  tokens_lookup(dict_1, valuetype = "glob")

tokens(text) %>%
  tokens_lookup(dict_2, valuetype = "glob")

1.) Is there a way to use square brackets at all in glob pattern matching?

2.) If so, would [a-z] also match umlauts (ä,ö,ü) or if not, how can we match characters like that?


Solution

  • 1) No, you cannot use brackets with glob pattern matching. However, they work perfectly with regex pattern matching.

    2) No, [a-z] will not match umlauts.

    Here's how to do it, stripping away all from your question that is not necessary to answering the question.

    library("quanteda")
    ## Package version: 2.0.1
    
    text <- "Ich mag Äpfel und Apfelkuchen"
    
    toks <- tokens(text)
    
    dict_1 <- dictionary(list(fruits = c("[aä]pfel*")))
    dict_2 <- dictionary(list(fruits = c("[a-z]pfel*")))
    
    tokens_lookup(toks, dict_1, valuetype = "regex", exclusive = FALSE)
    ## Tokens consisting of 1 document.
    ## text1 :
    ## [1] "Ich"    "mag"    "FRUITS" "und"    "FRUITS"
    tokens_lookup(toks, dict_2, valuetype = "regex", exclusive = FALSE)
    ## Tokens consisting of 1 document.
    ## text1 :
    ## [1] "Ich"    "mag"    "Äpfel"  "und"    "FRUITS"
    

    Note: No need to import all of the tidyverse just to get %>%, as quanteda makes this available through re-export.