I have a big data set which I cleaned up and found that one of the fields has value like
"My son is turning into a monster \xf0\u009f\u0098\u0092"
I am not able to create this simple data as it gives the below mentioned error
a <- c('My son is turning into a monster \xf0\u009f\u0098\u0092')
Error: mixing Unicode and octal/hex escapes in a string is not allowed
Now suppose I have this value in my variable and want to remove all non-ascii characters like
library(stringi)
b <- stri_trans_general(a, "latin-ascii")
and now want to converted text in the lower format
tolower(b)
I am getting below mentioned error
Error in tolower(b) : invalid input 'My son is turning into a monster 😒' in 'utf8towcs'
Can someone please help me to understand the issue
To remove all non-ASCII characters you can use regex. [\x00-\x7F]
is the set of all non-ASCII characters, so we can replace every occurrence with nothing. However, R doesn't like \x00
because it's the null character, so I had to modify the series to be [\x01-\x7F]
a <- c('My son is turning into a monster \u009f\u0098\u0092')
#> [1] "My son is turning into a monster \u009f\u0098\u0092"
tolower(gsub('[^\x01-\x7F]+','',a))
#> [1] "my son is turning into a monster "
or, with the octal codes
a <- c('My son is turning into a monster \xf0')
#> [1] "My son is turning into a monster ð"
tolower(gsub('[^\x01-\x7F]+','',a))
#> [1] "my son is turning into a monster "