I have a corrupted file where Windows-Special Characters have been replaced by their UTF-8 "equivalents". I tried to write a function that is able to replace the special characters based on this table:
utf2win <- function(x){
soll <- c("À", "Á", "Â", "Ã", "Ä", "Å", "Æ", "Ç", "È", "É", "Ê", "Ë",
"Ì", "Í", "Î", "Ï", "Ð", "Ñ", "Ò", "Ó", "Ô", "Õ", "Ö", "×", "Ø",
"Ù", "Ú", "Û", "Ü", "Ý", "Þ", "ß", "à", "á", "â", "ã", "ä", "å",
"æ", "ç", "è", "é", "ê", "ë", "ì", "í", "î", "ï", "ð", "ñ", "ò",
"ó", "ô", "õ", "ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ"
)
ist <- c("À", "Ã", "Â", "Ã", "Ä", "Ã…", "Æ", "Ç", "È", "É",
"Ê", "Ë", "ÃŒ", "Ã", "ÃŽ", "Ã", "Ã", "Ñ", "Ã’", "Ó", "Ô",
"Õ", "Ö", "×", "Ø", "Ù", "Ú", "Û", "Ãœ", "Ã", "Þ", "ß",
"Ã", "á", "â", "ã", "ä", "Ã¥", "æ", "ç", "è", "é", "ê",
"ë", "ì", "Ã", "î", "ï", "ð", "ñ", "ò", "ó", "ô", "õ",
"ö", "÷", "ø", "ù", "ú", "û", "ü", "ý", "þ", "ÿ")
for(i in 1: length(ist)){
x <- gsub(ist[i], soll[i], x)
}
return(x)
}
And now for a test
a <- "Geidorf: Grabengürtel"
utf2win(a)
And nothing happens... I guess the issue is that the character "Ã" is not recognized propperly. Do you have a solution for my problem?
This is an encoding problem. You may be able to fix it, but it's hard to know without the file. readBin
is a good bet if you can't force the proper encoding. Here is a summary of what I found:
I tried iconv
for the example string
iconv(a, "UTF-8", "WINDOWS-1252")
#[1] "Geidorf: Grabengürtel"
And it works, but you are right that something is up with "Ã"
iconv("Geidorf: Grabengürtel Ã", "UTF-8", "WINDOWS-1252")
#[1] NA
We can see which letters are problematic:
ist[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))]
[1] "Ã" "Ã" "Ã" "Ã" "Ã" "Ã"
# corresponding characters
paste(soll[is.na(iconv(ist, "UTF-8", "WINDOWS-1252"))])
[1] "Á" "Í" "Ï" "Ð" "Ý" "à"
The site you linked to has a relevant page, which spells out what the issue is:
Encoding Problem: Double Mis-Conversion
Symptom
With this particular double conversion, most characters display correctly. Only characters with a second UTF-8 byte of 0x81, 0x8D, 0x8F, 0x90, 0x9D fail. In Windows-1252, the following characters with the Unicode code points: U+00C1, U+00CD, U+00CF, U+00D0, and U+00DD will show the problem. If you look at the I18nQA Encoding Debug Table you can see that these characters in UTF-8 have second bytes ending in one of the Unassigned Windows code points.
Á Í Ï Ð Ý
"à" is a different case. You have mapped it to "Ã" when it should be "Ã\u00A0" or "Ã\xA0" or "Ã " (note that the space is not a normal space; it's a non-breaking space). So, fixing that in ist
takes care of one letter.
As for the other characters ("Á", "Í", "Ï", "Ð", and "Ý"), as is, they are all mapped to "Ã" in ist
, and you'll never be able to do the appropriate substitutions as long as that's true.