I'm not really sure how to make this into a reproducible example, and for that I apologize. But I have a data frame with a string column. When I run stri_enc_mark
on the column, I see that I have both 'ASCII' and 'UTF-8' encoded strings. This is an issue because when I try to upload this data into an elastic search database, then I run into the following error:
"Invalid UTF-8 start byte 0xa0\n at [Source: org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@40d00701; line: 1, column: 1425]"
I'm assuming this is because of the ASCII encoded strings. I tried to use write.csv(... fileEncoding = 'UTF-8')
but when I load up that CSV the string column still has a mix of encodings. Neither Encoding(x) <- 'UTF-8'
, stri_enc_toutf8
, nor stri_encode
seem to help out with the conversion.
Any advice or guidance would be awesome.
Thanks to @MrFlick I was able to solve the problem. Essentially, given a data frame with character columns of mixed encodings, the easiest work around was to:
df %>%
mutate_if(is.character, function(x){
x %>%
sapply(function(y){
y %>%
charToRaw %>%
rawToChar
})
})
This makes sure that all the characters are encoded in the same native
encoding. This solves the issue where I was unable to load the data into elastic search due to encoding inconsistencies.