Search code examples
rregexquantedaquotation-marks

In Quanteda, how can we match quotation marks literally?


Brief question as I'm trying to match quotation marks in a sentence token using Quanteda's tokens_lookup() function and valuetype="regex". Based on the information provided here on the regex flavor Quanteda uses, I thought the way to go with would be \Q ... \E, but that didn't do the trick.

library(quanteda) 
# package version: 1.5.2

text <- c("text „some quoted text“ more text", "text « some quoted text » more text")

dict <- dictionary(list(MY_KEY = c("\Q*\E")))
# Error: '\Q' is an unrecognized escape in character string starting ""\Q"

I also tried to match the quotation mark directly "“" which at least seems to be a legal regex pattern, but in the end that didn't work either. Nor did variations of \Q...\E with double backslashes as they are used for word boundaries for instance (\\b).

So the more general question I believe is whether the regular expressions mentioned here are compatible with what Quanteda understands as valuetype="regex".

EDIT:

This worked for the first string, yet not for the second.

dict <- dictionary(list(MY_KEY = c(".\".")))

Solution

  • Regular expressions in quanteda are built on the stringi package, which supports Unicode character categories. You can retrieve all of your quotes by using these categories in a search pattern:

    • Ps, Pe - punctuation, open and close
    • Pi, Pf - Punctuation initial and final quote

    I included all four, since for example is in Ps but not Pi, and « is in Pi but not Ps.

    Further details are here.

    library("quanteda")
    ## Package version: 2.0.1
    
    text <- c(
      "text „some quoted text“ more text",
      "text « some quoted text » more text"
    )
    toks <- tokens(text)
    
    tokens_select(toks, "[\\p{Pf}\\p{Pi}\\p{Ps}\\p{Pe}]", valuetype = "regex")
    ## Tokens consisting of 2 documents.
    ## text1 :
    ## [1] "„"
    ## 
    ## text2 :
    ## [1] "«" "»"