Search code examples
rencoding

encoding issues in R


As part of a larger workflow, I need to extract unique character strings from a data frame. Suppose I have this data frame:

dummy <- structure(list(location = c("RD WEST - ACOUSTIC RELEASE", "RD NORTH - ACOUSTIC RELEASE", 
"RD EAST - ACOUSTIC RELEASE", "R SOUTHWEST REEF", "RD NORTH - ACOUSTIC RELEASE", 
"RD EAST - ACOUSTIC RELEASE", "RD WEST - ACOUSTIC RELEASE"
)), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"
))

# A tibble: 7 × 1
  location                   
  <chr>                      
1 RD WEST - ACOUSTIC RELEASE 
2 RD NORTH - ACOUSTIC RELEASE
3 RD EAST - ACOUSTIC RELEASE 
4 R SOUTHWEST REEF           
5 RD NORTH - ACOUSTIC RELEASE
6 RD EAST - ACOUSTIC RELEASE 
7 RD WEST - ACOUSTIC RELEASE 

You would think that if I run

unique(dummy$location)

R would output only 4 unique character strings but instead it returns all of them. After some digging I found that the culprit is the encoding. (No idea how/why this got messed up. I imported excel files directly from google drive into R.)

Encoding(dummy$location) 
[1] "unknown" "unknown" "unknown" "unknown" "UTF-8"   "UTF-8"   "UTF-8"  

I tried converting "unknown" decodings to "UTF-8" but none of my attempts seem to work. This is what I've tried:

dummy$location <- iconv(dummy$location, from = "unknown", to = "UTF-8")

Encoding(dummy$location) <- "UTF-8"

dummy$location <- enc2utf8(dummy$location)

library(stringi)
dummy$location <- stri_encode(dummy$location, "", "UTF-8")

Any insights and solutions are greatly appreciated.

Thanks

EDIT: In case this helps:

for (i in seq_along(test3$location)) {
  cat("String", i, ":", test3$location[i], "\n")
  print(charToRaw(test3$location[i]))
  cat("\n")
}

confirms that the bottom 3 character strings are not identical to the top 3:

String 1 : RD WEST - ACOUSTIC RELEASE 
 [1] 52 44 20 57 45 53 54 20 2d 20 41 43 4f 55 53 54 49 43 20 52 45 4c 45 41 53 45

String 2 : RD NORTH - ACOUSTIC RELEASE 
 [1] 52 44 20 4e 4f 52 54 48 20 2d 20 41 43 4f 55 53 54 49 43 20 52 45 4c 45 41 53 45

String 3 : RD EAST - ACOUSTIC RELEASE 
 [1] 52 44 20 45 41 53 54 20 2d 20 41 43 4f 55 53 54 49 43 20 52 45 4c 45 41 53 45

String 4 : R SOUTHWEST REEF 
 [1] 52 20 53 4f 55 54 48 57 45 53 54 20 52 45 45 46

String 5 : RD NORTH - ACOUSTIC RELEASE 
 [1] 52 44 c2 a0 4e 4f 52 54 48 20 2d c2 a0 41 43 4f 55 53 54 49 43 c2 a0 52 45 4c 45 41 53 45

String 6 : RD EAST - ACOUSTIC RELEASE 
 [1] 52 44 c2 a0 45 41 53 54 c2 a0 2d c2 a0 41 43 4f 55 53 54 49 43 c2 a0 52 45 4c 45 41 53 45

String 7 : RD WEST - ACOUSTIC RELEASE 
 [1] 52 44 c2 a0 57 45 53 54 c2 a0 2d c2 a0 41 43 4f 55 53 54 49 43 c2 a0 52 45 4c 45 41 53 45

EDIT2: The solution, based on Roland's suggestion:

install.packages("fedmatch")
library(fedmatch)
    
test3$location <- clean_strings(test3$location) # this sanitizes the strings
    
unique(dummy$location)
[1] "rd west acoustic release"  "rd north acoustic release" "rd east acoustic release" [4] "r southwest reef"      
    
Encoding(dummy$location)
[1] "unknown" "unknown" "unknown" "unknown" "unknown" "unknown" "unknown"

I guess the "unknown" is okay then, as long as they're all the same.


Solution

  • Your issue is that you have No-Break Space (NBSP), i.e. \U00a0 in your character vector. We can see that by reconstructing them from your raw character dump:

    view_string <- function(x) {
        strtoi(x, base = 16L) |>
            as.raw() |>
            rawToChar() |>
            stringr::str_view()
    }
    
    east1 <- c("0x52", "0x44", "0x20", "0x45", "0x41", "0x53", "0x54", "0x20", "0x2d", "0x20", "0x41", "0x43", "0x4f", "0x55", "0x53", "0x54", "0x49", "0x43", "0x20", "0x52", "0x45", "0x4c", "0x45", "0x41", "0x53", "0x45")
    east2 <- c("0x52", "0x44", "0xc2", "0xa0", "0x45", "0x41", "0x53", "0x54", "0xc2", "0xa0", "0x2d", "0xc2", "0xa0", "0x41", "0x43", "0x4f", "0x55", "0x53", "0x54", "0x49", "0x43", "0xc2", "0xa0", "0x52", "0x45", "0x4c", "0x45", "0x41", "0x53", "0x45")
    view_string(east1)
    # [1] │ RD EAST - ACOUSTIC RELEASE
    view_string(east2)
    # [1] │ RD{\u00a0}EAST{\u00a0}-{\u00a0}ACOUSTIC{\u00a0}RELEASE
    

    So these strings are not identical. You can fix this by using the \h regex with perl = TRUE. The PCRE docs note:

    The sequences \h, \H, \v, and \V are features that were added to Perl at release 5.10

    These represent horizontal and vertical spaces, and their negation. \h captures 19 Unicode horizontal space characters, including NBSP:

    dummy$location  <- gsub("\\h", " ", dummy$location, perl = TRUE) 
    length(unique(dummy$location))
    # [1] 4