Search code examples
rmultibyte

Detecting and removing multibyte strings in R


So I have this multibyte string "UCA1\xa6\xc1" within a large vector of RNA names, which yields UCA1�� upon using the cat() function. I am trying to screen the vector for such strings and rename them to something else or if all else fails, remove them from the vector, as I cannot capitalize such strings with functions like toupper().

I'm not too sure of the data type that '\xa6' and '\xc1' encodes so I am unsure of how to screen for them using any form of regex. Could anybody help me with this?


Solution

  • This is probably an encoding issue, so try change the encoding during load! Try something like this,

    df<- read.csv(file_path, 
                    encoding = "iso-8859-1", "use different encodings/langs"
                    header = TRUE, 
                    stringsAsFactors = FALSE)