Search code examples
rstringcharacterspecial-charactersstring-comparison

R character that looks like whitespace but is not


I seem to have encountered an enigmatic character in R that breaks my code. I am using R, version 4.2.3.

Take the two strings a and b:

a
[1] "Actinomyces naeslundii"
b
[1] "Actinomyces naeslundii"

Despite appearances, a and b are not identical.

a==b
[1] FALSE

Consistently, a does not match b:

grepl(a,b)
[1] FALSE

Interestingly, not all characters are identical between a and b:

strsplit(a, "")[[1]]
[1] "A" "c" "t" "i" "n" "o" "m" "y" "c" "e" "s" " " "n" "a" "e" "s" "l" "u" "n" "d" "i" "i"
strsplit(b, "")[[1]]
[1] "A" "c" "t" "i" "n" "o" "m" "y" "c" "e" "s" " " "n" "a" "e" "s" "l" "u" "n" "d" "i" "i"
strsplit(a, "")[[1]] == strsplit(b, "")[[1]]
[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[21]  TRUE  TRUE

Character #12 is different. It looks like an innocent whitespace, only it isn't:

strsplit(a, "")[[1]][12]
[1] " "
strsplit(b, "")[[1]][12]
[1] " "
strsplit(a, "")[[1]][12] == strsplit(b, "")[[1]][12]
[1] FALSE
" " == strsplit(a, "")[[1]][12]
[1] TRUE
" " == strsplit(b, "")[[1]][12]
[1] FALSE
grepl("\\s", strsplit(a, "")[[1]][12])
[1] TRUE
grepl("\\s", strsplit(b, "")[[1]][12])
[1] FALSE

Using dput:

dput(a)
"Actinomyces naeslundii"
dput(b)
"Actinomyces naeslundii"
dput(a, file = "a.dput")
dput(b, file = "b.dput")

The generated files differ by one byte:

$ ls -lah *dput
-rw-r--r-- 1 johannes johannes 25 May 16 20:23 a.dput
-rw-r--r-- 1 johannes johannes 26 May 16 20:23 b.dput

Have you encountered this character? What could it be? How can search for it in my data frames?


Solution

  • Thanks to the useful comments, I am now in the position to solve above mystery.

    There are at least two modifications that render string b identical to a.

    Replacing the unicode character \U00A0 with space (" "):

    > b.mod <- gsub("\U00A0", " ", b)
    > b.mod == a
    [1] TRUE
    

    Replacing horizontal whitespace \h with space (" ") using packages stringr or stringi:

    > b.mod1 <- stringi::stri_replace_all(b, " ", regex = "\\h")
    > b.mod1 == a
    [1] TRUE
    > b.mod2 <- stringr::str_replace_all(b, "\\h", " ")
    > b.mod2 == a
    [1] TRUE
    

    Nevertheless, replacing \h or \s+ with space (" ") does not work with function gsub from package base:

    > b.mod3 <- gsub("\\h", " ", b)
    > b.mod3 == a
    [1] FALSE
    > b.mod4 <- gsub("\\s+", " ", b)
    > b.mod4 == a
    [1] FALSE
    

    Again, thanks to all commenters!