I seem to have encountered an enigmatic character in R that breaks my code. I am using R, version 4.2.3.
Take the two strings a
and b
:
a
[1] "Actinomyces naeslundii"
b
[1] "Actinomyces naeslundii"
Despite appearances, a
and b
are not identical.
a==b
[1] FALSE
Consistently, a
does not match b
:
grepl(a,b)
[1] FALSE
Interestingly, not all characters are identical between a
and b
:
strsplit(a, "")[[1]]
[1] "A" "c" "t" "i" "n" "o" "m" "y" "c" "e" "s" " " "n" "a" "e" "s" "l" "u" "n" "d" "i" "i"
strsplit(b, "")[[1]]
[1] "A" "c" "t" "i" "n" "o" "m" "y" "c" "e" "s" " " "n" "a" "e" "s" "l" "u" "n" "d" "i" "i"
strsplit(a, "")[[1]] == strsplit(b, "")[[1]]
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[21] TRUE TRUE
Character #12 is different. It looks like an innocent whitespace, only it isn't:
strsplit(a, "")[[1]][12]
[1] " "
strsplit(b, "")[[1]][12]
[1] " "
strsplit(a, "")[[1]][12] == strsplit(b, "")[[1]][12]
[1] FALSE
" " == strsplit(a, "")[[1]][12]
[1] TRUE
" " == strsplit(b, "")[[1]][12]
[1] FALSE
grepl("\\s", strsplit(a, "")[[1]][12])
[1] TRUE
grepl("\\s", strsplit(b, "")[[1]][12])
[1] FALSE
Using dput
:
dput(a)
"Actinomyces naeslundii"
dput(b)
"Actinomyces naeslundii"
dput(a, file = "a.dput")
dput(b, file = "b.dput")
The generated files differ by one byte:
$ ls -lah *dput
-rw-r--r-- 1 johannes johannes 25 May 16 20:23 a.dput
-rw-r--r-- 1 johannes johannes 26 May 16 20:23 b.dput
Have you encountered this character? What could it be? How can search for it in my data frames?
Thanks to the useful comments, I am now in the position to solve above mystery.
There are at least two modifications that render string b
identical to a
.
Replacing the unicode character \U00A0
with space (" "):
> b.mod <- gsub("\U00A0", " ", b)
> b.mod == a
[1] TRUE
Replacing horizontal whitespace \h
with space (" ") using packages stringr
or stringi
:
> b.mod1 <- stringi::stri_replace_all(b, " ", regex = "\\h")
> b.mod1 == a
[1] TRUE
> b.mod2 <- stringr::str_replace_all(b, "\\h", " ")
> b.mod2 == a
[1] TRUE
Nevertheless, replacing \h
or \s+
with space (" ") does not work with function gsub
from package base
:
> b.mod3 <- gsub("\\h", " ", b)
> b.mod3 == a
[1] FALSE
> b.mod4 <- gsub("\\s+", " ", b)
> b.mod4 == a
[1] FALSE
Again, thanks to all commenters!