Search code examples
rdata-cleaningstringi

Removed non-ASCII values and then lowering text is giving error


I have a big data set which I cleaned up and found that one of the fields has value like

"My son is turning into a monster \xf0\u009f\u0098\u0092"

I am not able to create this simple data as it gives the below mentioned error

a <- c('My son is turning into a monster \xf0\u009f\u0098\u0092')

Error: mixing Unicode and octal/hex escapes in a string is not allowed

Now suppose I have this value in my variable and want to remove all non-ascii characters like

library(stringi)
b <- stri_trans_general(a, "latin-ascii")

and now want to converted text in the lower format

tolower(b)

I am getting below mentioned error

Error in tolower(b) : invalid input 'My son is turning into a monster 😒' in 'utf8towcs'

Can someone please help me to understand the issue


Solution

  • To remove all non-ASCII characters you can use regex. [\x00-\x7F] is the set of all non-ASCII characters, so we can replace every occurrence with nothing. However, R doesn't like \x00 because it's the null character, so I had to modify the series to be [\x01-\x7F]

    a <- c('My son is turning into a monster \u009f\u0098\u0092')
    #> [1] "My son is turning into a monster \u009f\u0098\u0092"
    tolower(gsub('[^\x01-\x7F]+','',a))
    #> [1] "my son is turning into a monster "
    

    or, with the octal codes

    a <- c('My son is turning into a monster \xf0')
    #> [1] "My son is turning into a monster ð"
    tolower(gsub('[^\x01-\x7F]+','',a))
    #> [1] "my son is turning into a monster "