Search code examples
rregexencoding

strings are identical (using `base::identical`) and yet behave differently with `grepl` / `gsub`


Note: I cannot reproduce this situation anymore on my MacBook. This was on windows, this might be corrected or not. In any case the discussion is interesting, but I'm not sure if it's still valid.

Related to: Convert upper case words to title case

Some code that uses strings fetched from online doesn't behave as I expect, you can reproduce the issue by running the following:

library(xml2)
library(magrittr)
x <- xml2::read_html("https://poesie.webnet.fr/lesgrandsclassiques/Authors/B") %>%
  gsub("^.*?<span>(Pierre-Jean de BÉRANGER)</span>.*$","\\1",.)
x # [1] "Pierre-Jean de BÉRANGER"

This string is identical to "Pierre-Jean de BÉRANGER" copied/pasted from page source, however the following behavior is very disturbing to me:

y <- "Pierre-Jean de BÉRANGER"
x == y  # TRUE
identical(x, y) # TRUE
gsub("\\b([A-Z])(\\w+)\\b", "\\1\\L\\2", x, perl = TRUE) # [1] "Pierre-Jean de BÉRANGER"
gsub("\\b([A-Z])(\\w+)\\b", "\\1\\L\\2", y, perl = TRUE) # [1] "Pierre-Jean de Béranger"
grepl("\\bB\\w+", x, perl = TRUE) # FALSE
grepl("\\bB\\w+", y, perl = TRUE) # TRUE
grepl("\\bB\\w", x, perl = TRUE)  # TRUE
grepl("\\bB\\w", y, perl = TRUE)  # TRUE

If x and y are identical, how can these give a different output ?

?identical :

The safe and reliable way to test two objects for being exactly equal


Edit:

Here's an observable difference :

Encoding(x) # "UTF-8"
Encoding(y) # "latin1"

I'm running R version 3.5.0 on Windows


Solution

  • If you check out the source of the identical() function, you can see that when it's passed a CHARSXP value (a character vector), it calls the internal helper function Seql(). That function converts string values to UTF prior to doing the comparison. Thus identical isn't checking that the encoding is necessarily the same, just that the value embded in the encoding is the same.

    In a perfect world, the identical() function should have an ignore.encoding= option in addition to all the other properties you can ignore when doing a comparison.

    But in theory the strings should really behave in the same way. So I guess you could blame the "perl" version of the regexpr engine here for not properly dealing with encoding. The base regexpr engine doesn't seem to have this problem

    grepl("B\\w+", x)
    # [1] TRUE
    grepl("B\\w+", y)
    # [1] TRUE